Module 4 Algorithms For Data Science
Module 4 Algorithms For Data Science
Module 4:
Algorithms for Data Science:
Linca ReqneSSIon .-
is a inean modelling approadh to nd
olaonship
olafonship
bekoeen One mone Jndependert Vantab lo» (pmeditona)
denoted as X and dependert Voodables (tanget) denoted
inean egpreior.
s
when
when there i
there
ony
ony
one x
X..
Simple So there ony
only
o ne
Voiable.
indopendent
all akout knding the b k RE ine.
Regmenion is
inean Kecurive m a m e r .S o frst a nandom
On
done bas ed is Calculated
C aluated kom his
ditance
and known
draon diu tance i
Jine is and the
Ql the poin
line op
J
and to enue hact thene ae no
negaihe
. .
the diutance
VotuesThid is Squaned,
So Squane of
op he Ppoint k1om the ine and add it up. and
it iis
T made Sure od thee end Sum hich 19 tnoon
noon as
Sum Of Squaned eons he minimum
he
itenaive procels
T t7t is mintum.
e01ODS mugt be
The SumSum Of
Of he Squaned
obiy undeutandble
Understanding the hgithm:
1-mis
on
Ihe ndepenodent Vasiable
Vasiabl on y-cois.
and dependent
the
O the datpoint inceaking on
te LR l i e
would get
-azis
ip the he
dcta poins dnuechng On
-ve
Hhe
Equaion is
m tC J-Intencapt.o4 he line
V+ve Slope of the line
which hag
which bas the Least esnon
One
ine would be
Mathenaical Jnplementaion O
3
C 5
3
3.6F (3,3.6.
55
3 3.6
mean :
S (Inde)
3
(3)S0, we ound a
point (3,3-6)
eguaion of OU ine
Tegion une
According to Hhe
mx+C
now, we have to nd and C
m ( - ) Cy-9) T= mean = 3
mean 3.6.
ZC-7
x-). The dàtance of
mean
- y-
au the
poin om
3 3 0 6 |4_|(9-)=
7he distance o
au he poín hom mean
-:6 O
0.4|
5 5
- 10
0.4
-0.h
equaion
put in the
0. Let u
36 0. 4 x3 +C
3.6 - 3 tc
C 3.6- /a
C
3 3.6
Fomuta i
R Distana acual Meon
(y-9)
- DSep-9
3 -o 6 O36 &8.8 0:80.6
o.4 0.16 3 0.40.16|
-1-6 5 6 3.6
o
3
4 o40164o O.4 O 16
5 h196440.8 o.64
5
Su- 52 Gund-6
1.6 O.3
R5
kegneion ine
ane Vey fan
fan auway iom
points
TheThe data
not a ood Value
30y is bad?
ane
Values
Low R-Squane it wil be Low
g4Low is ePected
it
haf
areas,
A
B JIn
n Some
Some
(human behaviour)
EX- Pshycalogy
Logishc Kegresiorn
--
is one fthe most populo
LogisHe Kegpewion
machine Jeaning
algoithma fo bing
AclakifCai,
ML aoithm.
Superised
it 3 a
Logistic Regnesion?
yes/No, Tue/ Fae
Categonical Values. Such
as
Fon
Bestfit line
+
yMalignart
-> eghdld
0.S-
Benigo
CRaksP
(N
h) =
Jndepencdest Vaiable
Bredickion Slope
Condion o5
Ex
3.5 I
raining dataset,
Jureaie the
the
So because
op r ohee umo e .
în the prafechom blem. e where
the change
"Satuuhion to the
annot be a
tto
t o pictwre.
picture.
Comes
Regnekion
he Jogistic
he
Regreion
3 seps lcgistie
X anod
.nt
Rerericn.
O Calculate the cgishe
Y Vasiables to plot the gph
uing he
Stgoid fencHon and
lemn he Co-ecient
ond lader he using the Stochistic
tadient descant.
Model pp he
Then uing the
Biedichon
Q:-Datase raining Se 500 tumo e olata
X Y
Length,
uhodS
Sp clas
a.7 5
.3 O
3.3 O
806 3.05
5.3 2. 75
he +e Gach
to ransonmn
the above Cpuaian helps
oto.
Value to arge b
inpud
clas Benign
the model panameters
Step: We need to rd
training Set input
uking the
oiu be Calulatel
Initialy, we he Bo, B,,
SGD. (Ezplained
in cla)
the
using
using
Jnitially, we aume,
B, O.0 =O.o.
8.:0. 0,
X,.7 5 , Y=o
Pmectehon +e
-(0.0+ o.0x0.7 +0.0x.s)
O.5
updated Values ot
b -0.0375, b,= , bo=-0.095
-botb,z, + )
1esul
ot
K-mean Clustering Atacithm i
to divide the
k-mean chusteinng algoi hm
L Ose
Dataset
A
3
B 5
Yandomly cusbens-
Stpl Selechng
and
custen Centers
the
inding the distance blw
Step -
each data Pointa
and
a, (1)
O P
O Pa
_P
83
Pa
a(5.5 5 3.61. Pa
1
( 1 + 1-1
wu e caleudasal
'i (5,5)
The distance wiu be tound om P(,3),
check the toble
Least ones
R ) +) + (a,
=(6,4)
A (a, 1.33)
Some ko
: )+ (u,3)+ (s.5
Pa (3.64, 3.67)
Same custen
Repeat Sp unkU we gek
pMev iouu itetaion
eement a in the
Dislana tabla.
Distana o m D i s t a n t klo Aigned
Dada birt (,1-33) 8.67, 3.67)Cente
05 3.T8 P
a, (L1
aa ( ) 0. 33 315 P
a3 (3) I67 8
Q4 ( 3 , 8 P
aF (4,3) .605 0.75
pebviou8
and pnesent
Clustes
as pen Rle he
custen enk
So he new
ne dienent
hasto be omed
C, D+(»,)t ( ) + (3,3)
PP:
8) Ca, i-75)
P L.3)+ (5,5
( a . 8 ) = (45,4)
Centes ate
Ne u s ten
P ( ,1-4s)
- (4-5,4)
Fven he Centens an also ot ame as pevious
Even the
Step has to be Nepeated
meaiune
StmiLad
Sgtems
Ex-) Reco mmendey
Wispedia
Seanch algooithm hen
KNN J9n general alo ued as
Simila tems.
we are
Seonchia
a 2s k in KNN ?
when K=i.Number oP Neanext Neighbor).
3 hae
apned 1, he teaing data g o
So, koher it ia
ven the closest etample în the taining set
algorithmn does is
Step3 whod kNN
Now
Distanca bw Ras hmika
inds Ot the
Data o the
and wi h the othen Ree7enca
table, So hene. to ind the Distance b/uo hen
tuses he mula
a)*+ (v-1)
Call ed Euclidean iStance
whith
Coik e ms
Dhoni Lkes
he clam Ccket is m&
Sine will
SRK iks Football Neigb s ,
hen kNw
Neaneg
Common omong thele
w be in he Clas peope
Pnedic Rashmika
who ke3 ricket
Natve Bayes Clasifien Maoaitra
Machine heanrin9 olgoit,
Supevised
ts a
with clamipication technique
Baye s heoye
Given ahypothesis(Hand uidence E)
EidercelE.)
he telaionship
iven States hot
Baye 's theoven hypotheia
the pnobability the
)and
and he
he
berween
evidence
eidenca P(4
(P(H) )
he The
t e
2eting
beore
hyporhesis t e getin
he
Brobabi ityH ote
evidence PCHE)is
P(E). P(
P(E
PnB) P(elo)=
P(BnA)
P(PlB) P
utlookenp
pAY
Humidibnd Ploy Caicke
Suroy Hot High
oeak NO
Cwmk Yes
3 overCast|_Hot high
Rain mild high | weok
oeak Yes
9 Sunny| Cool hormal
eak Yes
1o Ralomild nommal
Sunnymild Noma Shong
12 OvesCastmiLd high Shon Yes
3 OverCatho noalweak Yes
whete
Huniclity Hig
Cool
Temp Gutlook = Sum
wind shwm
Total o Ras I44.what to find1
Wind C14)
Weak8) SPrngC6)
Phearlys) : 9 Total Ne Y
Phonglyea)
5Tota e q o
P(Sher/o-
Humidity (4)
Hig) Nomal()
No(4) No (
P(nemal /ya) 4
:
P(bg/ya)=79
P(noomal Ro= 5
Temp (14
Outloor (14
P(Cool/No)
P(Mo) P(Sunny /o)
Pxpo) P(bigb /No) P(shorg/no)
s 3/s
O. O006
and No.
Vaue
with Yes
Compare he
thon aYes
hon
No is 9neater
Since the Value of
Caicket
On 15T day toe
Should no
go
fo
Play
Ars
Real- ime
E-mail Span tenn
eather fovecasing9
Medical diagonsts
detrchionclapiak
ojet
(ategongahon.
leuos
inipal Component AnalySis.
Jnomation.
Jmport ant
an
any
Mange.
Rakiog feotures lange
Movte No
Rating_ Doodoad
l O-5
5 1383
668
763
342 doon loads
701 No of
arges behoean
l000-5o00.
1:9
3. 3.0
&.7
I5 1.6
0.9
.9 MeanMean
0
adhibutes
oomula o
(-) r(x-i)
CevCov ( z x ) z m-
X-
C-) (i-1)
X O.476
.5 0.69
I. 7161.
0-S -13
enhies. = 5.5490.
a l l h e
(Fox) SummaHion
5.590511O 0. 6|65
Cov( ) =
10-
So Cov (9.y) = o.7165.
Cov (x,1) Cov(4.9)
C
Cov (.) Cov(s.)
O.6165 o.6154
Cov
O.7165
O.6154
Veutors.
Values and Eigen
Aind the Cigen
C I =
O.
O. 6165 o. 6154
o. 7165
O. 6154
0-6154
a u l e .
O. 6165 Sta
the
6154
O.7165
s-] Hs per
t o.0630 = O
1333
, o.O490o
1 . . 284o.
Values..
agen Values
have
now
oe
So
Value,
nou uan Veo
So. Pov Cveny igen
mhod chon
he
has t be found,
o
C
Cigen value
Covamance
Ctgen Vecr
Gigen vedor
O6165, o.61S4 x
0:0490
O.7165
Y
O.6154
O. 5674 X
- 6154 Y
Simp
6-6154 X = 0.6674 Y
o.6674 Y
0.6154
- 1.0844 Y
beCause
The above Simplipicaion is done ony
Telatio nship blo 3, and Y
ind he
- I 0849 Y
be inalised.
alues has to
Now x, and
Let
t us
us Conpute
procadure,
The one the
e 1770 +I
.176l4
Dot
he Square
nd I. 4791
&.1461
used to
wil be
Values 14291
Te Computed
the Vectot
devide
/.o842
-o.735
.499479
o.6778
1.4391
No ompute Value for a
arive at he Cquahom
o.93194 Yo
O.8u99 +1
|o9194
I81y99
3601
O.6778
o. 735
o.6778
-o.735
e,
e e .335
o. 6778
DadaSc : -
Deive nw
Hrst
Prindpal P P4
Componet
e
irst Value o y mean
Y
.735 0.6778
Stmilasy Compute he Value x and y
the e above FOmula mwltp
igen Vec-on
ith he C ransport o
aive at the Sing Value.
PCL P Pia a P
4 dimen&ion
which is
Nu dataset aived
anes and Hypen Planes
ine in DimenRional Space-
Equation of
att bytC O. =
M+C.
be a Planes in 3D Space
Auos hne
wil
Auways
ins.
Equaion plane ia
ax bytcz t d
= o
d
Qx-by-by
-
tonitten as,
Can be
Plane
ne 3D.
Co-ordinde
epuahion hd is Hhe
he
the above
Ans:Jn
equivaktt Jine Typeplane
hat is the Mequinement of typerplane in Csene
Doda Steng,
.Equaion fline =
mtc
ay+bytc O
Name and y as
m , b:
COnitten as
Can be
So, he e
o
ax,+ bz +C
=
Gyenerad eguaim
fon 3D
Plone
Fo - dimenkional
+ Wnnt ,
Stored vecton,
on be as
wn in
a an be wtten as
, ,to3 .
Wn| 2
13
w'x tw, = O
twhat is w.Y
D une
going to
Now again Y m +e
OO
ax +byt c =
W=Fo.
paing om oviain. 9
As it's
For aD
So
HT,= O.
Vecoy Machine
Suppovt
What is Support Vecdor
What is bypexplane
Margial distante
Sepoable.
ineanly
Non- Lineay Separable
Support Vector Machine.
I p data
Model Phedacien
Eaanina
op
Oncethe model is built
above diagram, task to Prelz
prdrt
Hs pe he
Mackine
Machine tast to
data. the
wtht hhe
w e raiin9
raining
is fed a Sgtem. t has
the Meuw data Squane On cincle.
When a
hthen
tt ài
precic
the better understandg
better undestanding
The
The drauon
draun ph Cone Cirle data potn
1 0
Mangin
Squane
eisicn
b o u n d a
dála
upesplore
PoTirits
x
is dnawn
+o Sepanate
Hs per the graph, a line
Sqone clas. the line Called
Cncle clas and
Hhe
the tn eneral.
and
Hypeplane
deciston bouncdary. (o w.1. + o7 o hypemplane
hyperplane
ne
apanalel
.4.t
Step. draoing
drawing
data poin. now we get he distane.
the
touching his Point isaled
is aled
D+ and D- the Summaion e
b-+D+
= Margin
Margin
drawn to he hyperplane
Dosnallel ine
The and cince
Square
touching datapoint
the Support Vector8.
Called
those data poins
wicth the
As penthe 9aph, The
be high. because when
because
hypenplane must aluays
It. the width
Mew data point Keaps adding to
a good pHedichion.
must hotd and helps to give
and Accurac
(o,-), -1,0
(uo). (o. 1)
Let s dnaw a
ph Red v e
Stepd-
Bla +ve)
are close to
which
dlata point
he
Sepa Sden data.
and abel them
dada poin cwhich
-ve
+ve and depending o n the
he
Support Veckors.
S,, S,S 3
ane
hese
ae clo sest 3
( 9 ) , s3
S ()
S
wite The hypeplane equatien
Noto
Vecor6 each Vecdor
Step3: Fo he Sdenhiied
Qugmented
ith a 1 as a
bias ip
b+ag p
has to be
So
S then
() 3
S() then S
) 3
S () then 3
coeight he Vector has to be
Steph The
which is 3
a, as he 3 data otnts ane
+ -
See, Sa and 5
above equatiem, & and 3 as we Can
d (/ 3 3
= 1
/ -
3
+
3
-1
3
= 1
// // /
o,(1 0+) t 3+0 ti) t a (3+0+- -1
4
#(-1 +1) + s ( 9+1t1) =
,(3+ 0+) t
t
O
t o 1
4 + 1
abore equalien and get
Solve the
the values
mulply by a
Ha8 , + 8 O
1
(-
- 3 da - 3.
4o,tl +9 =t
4,+9
(-
a-o =o
Subsihing in e
, +8 o -J
Suhtihding in e4'
d+5= V4
52
Substiule Value
-S2
O.75
2 O.T5
- 3.75
Stved, tus Pind e wdgt e
Once, the equadion
the Vecon,
3 3
+0.75 +0.75
-3.5
we n
above yectors,
ad the
t e
Above, ooking augmetad wth
Vecor is
hat OUT wighe
bas
wth above Vector when we Cquade the
fset b.
Last enhy
in 3 as the hyperplare
and when e write Sepavatey
pesplane equation: y wx +b
() and b -
b-a= O
b-21
an Say
X-1 and
he above ) oe
parnllel to is
the ine is
i,e - axis
to
is parallel
the ine
X andL
O. Tt
il be
4 ( 1 )the Line
y axd8.
-3
Seporotey
ko a p e r P]ans
Normad
n rlan Paming
Find an E
2) s (-,
S(-2,a
R (+, 2,
, 2)
Sol m'
OS-O
= (-2,,S) - (4,),2
--4,2,2)- (-1, ,2)
QR 3 , 1, o) s ,o,3)
e in elan
S
-3
O
+l (0 +1
i ( 3 - - 3(-9
393+
-O
n mal
h
Pla
To Cha ch
Tale
a n d om
OP O
V P
C a , z ) -( -1, 2 )
y - 1 - ) , z-2)
normal
Ploune2
Susihde the values in eauahian-0
+ (y 1) +(Z-
3++11- z 2
3 Z -8
+z -8
o
3 9
Decision lnee
he posible
Condthons.
aHe base d on Some
he oot noce
Spl:thing s dividing
Spliting into dippenent paet
on the
Sub node
Sone Condation.
bas of
he hee rode
fontmed by Spliting
Branch sub tree
:-
2
3
Diagonsis So Let us
ath bute is ,
3 3
JnRo goin - (+
= -0.8 log (oa) xa + o.4 log, (o.4)
o.6log
Ulo
Co.3) logo4)
- / 0 . 6 (-0.sap)
O.4 397)
O 301
30
taking -) Common ge Cancel.
o.6[23] + o. [ 318]
1.56
0 Soxe throat
5-T A C
Yes 2 å |5
No
Se throatt
0)- () )+)
1-52
INe) -
) )
52
Alibute Gnain
Sae throat o05
Feves
Swotlen O. 88
glands
O.45
Congestion
Headache O.05
hoosen bastd on
altibute will be
Spliting Yalue. Consuchng Ds
Gain
which has he highes
For Soollen lands No Suwallen Gtlands
which has Valuesj atiburle
So 2"
Atle and atd,
hich has
kott be checen
higheas lekes
NO Yes
0.72
Cannot e Divided
Dedsion 19eekrrthe
lime- Sevies Amalysis
Vasiable
analsis. you,jut have
One
Jn his
i,e TiME
in onden to
Serics data a analysed
The
Theime-
and othe
Staistic&
etract meoning ful
Chassactesistics
x Any
Hy dodaset, ie we
eypluca
ane eoluading
Ke wise
oith the ime
accondance
Saks. (on)
morth Sales.
Next yea Sales, ( o Nert
tto time.
time
w.t
Stck maikat
PredicHon
Stock maske
O Busines oneCostirng -0predicion f
fon he nert daH
Sales a Company
pnedichon of pspduc
fon he net y
elaively highey o
OTmend: The movement
Stahionany rend
be here with
Seasonaty :- Same as T tend,
ime pesdod
in a Prred
rristmas.
Sale of Cake, choco duaing
Ex
not Same
pattem a fixe Hme,
Repeahirg
rend
during Shoxt time
Tveqularihy -t happens a
and non-Yepaafng
nahural disaster, during his time the eu
Rx
itema in need wu be Sotd .
annot
tablts, on)
on Such data.
rely
up and daon
Cycic tepecing
much hasnde
ho Pired pattens. and
m o vennent
to pmedit
fancHons, ie So he z i
he
alue of
and
and Cos Ca) aPplying ime Series
no point in
avai lable. Values
there s a fomula and
when already
ayailable.
C1e
Stationaity in he dala
the data ko Over a time.
The behaviowt o is
hen here a
pattem.
is having Specipic wiu Pollou h e
that
Vey high probabtli
in he Puure
Same
ARIMA MoDEL
AR MA
models e,e
Combinaion of
AR Auto Regneive
MA Moving Average
Both ane Separate model, what binds t togetheY
a Jntegaion pant
I Integation
xplainaHon
heve ave
When an andluaina the data. ip
T, Ts. . ..
Th P
umber ime
ime data, Ti,
blw TI5 a n
nelaionship eristal
Q
hexe
auto Re9reion.
s Called
TR9 tha
in d dala
here is a noise (o Jnreqularity
i be
Could be taken and
and Can
he avero
averope
pupase
Jmp lorted fo he analys
ARI M4
Jntegraion
(Order ditpereniaion)
Au
( RegreMive
Moving Average
Ca)
mdel has knouan by Vaviables P.g.d.
The each
But weak doesn’t mean useless. You can combine them into an ensemble. For each new
prediction, you run your input data through all four models, and then compute the average of
the results. When examining the new result, you see that the aggregate results provide 96
percent accuracy, which is more than acceptable.
The reason ensemble learning is efficient is that your machine learning models work
differently. Each model might perform well on some data and less accurately on others. When
you combine all them, they cancel out each other’s weaknesses.
You can apply ensemble methods to both predictions problems, like the inventory prediction
example we just saw, and classification problems, such as determining whether a picture
contains a certain object.
Ensemble methods
For a machine learning ensemble, you must make sure your models are independent of each
other (or as independent of each other as possible). One way to do this is to create your
ensemble from different algorithms, as in the above example.
Another ensemble method is to use instances of the same machine learning algorithms and train
them on different data sets. For instance, you can create an ensemble composed of 12 linear
regression models, each trained on a subset of your training data.
There are two key methods for sampling data from your training set. “Bootstrap aggregation,”
aka “bagging,” takes random samples from the training set “with replacement.” The other
method, “pasting,” draws samples “without replacement.”
To understand the difference between the sampling methods, here’s an example. Say you have
a training set with 10,000 samples and you want to train each machine learning model in your
ensemble with 9,000 samples. In case you’re using bagging, for each of your machine learning
models, you take the following steps:
When using pasting, you go through the same process, with the difference that samples are not
returned to the training set after being drawn. Consequently, the same sample might appear in
a model’s several times when using bagging but only once when using pasting.
After training all your machine learning models, you’ll have to choose an aggregation method.
If you’re tackling a classification problem, the usual aggregation method is “statistical mode,”
or the class that is predicted more than others. In regression problems, ensembles usually use
the average of the predictions made by the models.
Boosting methods
AdaBoost (short for “adaptive boosting”), one of the more popular boosting methods, improves
the accuracy of ensemble models by adapting new models to the mistakes of previous ones.
After training your first machine learning model, you single out the training examples
misclassified or wrongly predicted by the model. When training the next model, you put more
emphasis on these examples. This results in a machine learning model that performs better
where the previous one failed. The process repeats itself for as many models you want to add
to the ensemble. The final ensemble contains several machine learning models of different
accuracies, which together can provide better accuracy. In boosted ensembles, the output of
each model is given a weight that is proportionate to its accuracy.
Random forests
One area where ensemble learning is very popular is decision trees, a machine learning
algorithm that is very useful because of its flexibility and interpretability. Decision trees can
make predictions on complex problems, and they can also trace back their outputs to a series
of very clear steps.
The problem with decision trees is that they don’t create smooth boundaries between different
classes unless you break them down into too many branches, in which case they become prone
to “overfitting,” a problem that occurs when a machine learning model performs very well on
training data but poorly on novel examples from the real world.
This is a problem that can be solved through ensemble learning. Random forests are machine
learning ensembles composed of multiple decision trees (hence the name “forest”). Using
random forests ensures that a machine learning model does not get caught up in the specific
confines of a single decision tree.
Random forests have their own independent implementation in Python machine learning
libraries such as scikit-learn.
While ensemble learning is a very powerful tool, it also has some tradeoffs.
Using ensemble means you must spend more time and resources on training your machine
learning models. For instance, a random forest with 500 trees provides much better results than
a single decision tree, but it also takes much more time to train. Running ensemble models can
also become problematic if the algorithms you use require a lot of memory.
Another problem with ensemble learning is explainability. While adding new models to an
ensemble can improve its overall accuracy, it makes it harder to investigate the decisions made
by the AI algorithm. A single machine learning models such as decision tree is easy to trace,
but when you have hundreds of models contributing to an output, it is much more difficult to
make sense of the logic behind each decision.