0% found this document useful (0 votes)
0 views

תרגול - Bayesian Learning

The document outlines the process for appealing grades and provides details on implementing a k-Nearest Neighbors (kNN) algorithm as part of an assignment. It discusses feature scaling, efficient distance checks, and the use of probability in classification, including concepts like Bayes' rule and maximum likelihood estimation. Additionally, it covers the importance of cross-validation and the classification process based on posterior probabilities.

Uploaded by

benmizar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

תרגול - Bayesian Learning

The document outlines the process for appealing grades and provides details on implementing a k-Nearest Neighbors (kNN) algorithm as part of an assignment. It discusses feature scaling, efficient distance checks, and the use of probability in classification, including concepts like Bayes' rule and maximum likelihood estimation. Additionally, it covers the importance of cross-validation and the classification process based on posterior probabilities.

Uploaded by

benmizar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

© Ben Galili IDC

¡ After grades are published, you have one


week to send a mail with your appeal (for HW
1 the one week starting now)
¡ You will get a response mail with the appeal’s
decision
¡ The grade will be updated only in the excel
file (uploaded to the first section in Moodle)

© Ben Galili IDC


¡ Family of learning algorithm that:
§ Doesn’t build a model to the data (like tree in Decision
Tree)
§ Instead – compares new instance with instances seen
in training
¡ Time complexity:
§ Fast learning (No learning…)
§ Potentially slow classification/prediction (O(n))
¡ Space complexity:
§ Store all in instances (O(n))
¡ Used in both Classification and Regression

© Ben Galili IDC


¡ How to find nearest? √
§ We know the possible methods & we use X-Fold
Cross Validation to chose best one?
¡ Slow query & Large space √
§ We now able to reduce space (irrelevant points) &
accelerate query time (K-D tree, reducing
calculation time)
¡ How to choose k? √
§ We use X-Fold Cross Validation to chose best one
© Ben Galili IDC
¡ This assignment has 3 phases:
§ First
▪ Implement a feature scaler.
§ Second
▪ Implement kNN algorithm
▪ use cross validation in order to find best hyper parameters (K,
p for the distance method, weighted / uniform majority)
§ Third
▪ Examine the influence of the number of folds on the running
time of each fold and the total running time.
▪ Implement an efficient distance kNN and see how it
effects the running time
© Ben Galili IDC
¡ Feature Scaling
§ 1 class : FeatureScaler
§ Should receive an instances object and return a
scaled instances object.
§ We'll use standardization on scaling in this
assignment:

© Ben Galili IDC


¡ Implement kNN
§ 2 classes: MainHW3 & kNN
§ The kNN class is the algorithm object
▪ You need to think which properties the class needs
(hint: think which parameters kNN algorithm needs)
§ The MainHW3 should find the best combination of
k, p (the distance method) and the voted method
– It should go over all combination and select the
one with the smallest error using cross validation

© Ben Galili IDC


¡ Efficient kNN: Efficient Distance Check
§ After you found the kNN parameters, implement the
efficient kNN
§ You need to implement an efficient distance check :

,
' .
# %
d " # ," % = ( ") − ")
)*+

§ Remember – the goal is to stop iterating once we are


above a desired threshold

© Ben Galili IDC


¡ Our previous model didn’t use probability
calculation (except maybe in the goodness of
split)
¡ The most intuitive algorithm is to return the
majority class, or in other word – return the
most probable class according to the training
set
¡ Today agenda – probability algorithm –
algorithm that uses probability techniques in
order to predict new instance
© Ben Galili IDC
¡ Sample space
¡ A sample space is a set of events which lists
all possible outcomes:
§ For a coin toss this is the sample space: S = {H,T}
§ For rolling a dice this is the sample space:
S = {1,2,3,4,5,6}
§ For rolling two dice this is the sample space:
S = {(1,1), (1,2), (1,3), (1,4), …, (6,5), (6, 6)}

© Ben Galili IDC


¡ Events
¡ Any subset of the sample space is called an
event
§ For rolling two die
E={(1,6) , (2,5), (3,4), (4,3), (5,2), (6,1)}

© Ben Galili IDC


¡ Events
¡ Some basic operation on events:
§ Union
§ Intersection
§ Complement

© Ben Galili IDC


¡ Random variable
¡ Some function of the outcome event
¡ For example, the sum of two die (not the two
numbers that come up):
§ Let X be a random variable denoting the sum of
two dice rolls:
▪ ! "=1 =
? 0
(
▪ ! "=2 =
? ! 1,1 =
)*
)
▪ ! "=4 =
? ! 1,3 , 3,1 , {2,2} = )*

© Ben Galili IDC


¡ Random variable
¡ We now can define the expected value of a random
variable:
§ For discrete variable:

! " = $ &'(&)
%
* Where p is the probability mass function (pmf)
§ For continuous variable:
,
! " = * &- & .&
+,
* Where f is the probability density function (pdf)
© Ben Galili IDC
¡ Random variable
¡ The variance:
! " = $%& ' = ([ * − , " ]
¡ The standard deviation (=square root of the
variance):
! = $%& ' = ([ * − , " ]

© Ben Galili IDC


¡ ! " ∪ $ =?
! " +! $ −! "∩$
¡ ! " ∪ $ ∪ * =?
! "∪$ ∪* =
! "∪$ +! * −! "∪$ ∩* =
! " +! $ −! "∩$ +! * −! "∩* ∪ $∩* =
! " +! $ −! "∩$ +! * −! "∩* −! $∩* +! "∩$∩* =
! " +! $ +! * −! "∩$ −! "∩* −! $∩* +! "∩$∩*

© Ben Galili IDC


¡ As we said before the simplest way is to ask
which class has higher probability in the training
set
¡ What is the probability that you’ll pass the
exam?
§ We have training data of the previous year result
§ There are 2 classes: ‘Pass’ or ‘Fail’
§ ‘Pass’ probability is 90%, and ‘Fail’ is 10%
§ So every one here has 90% to pass the exam
§ This uses the prior probability, and in our context is
the class distribution in the training set

© Ben Galili IDC


¡ Conditional probability:
& '∩)
§ ! "|$ =
&())
§ ! "|$, =?
& '∩). /.,
▪ = =1
&(). ) /.,

§ ! "|$2 =?
& '∩)3 /.,2
▪ = =0.75
&()3 ) /.,4
§ ! "|$5 =?
& '∩)6 /
▪ = =0
&()6 ) /.,

© Ben Galili IDC


¡ Example:
§ ! !"## = 90%
§ ! (")* = 10%
¡ We also know that:
§ ! ,-"./ 01. 2ℎ- 2-#2|!"## = 90%
§ ! 5)6/7 2 *-"./|!"## = 10%
§ ! ,-"./ 01. 2ℎ- 2-#2|(")* = 5%
§ ! 5)6/7 2 *-"./|(")* = 95%
¡ What is the probability that you pass the test if
you learn?
© Ben Galili IDC
§ ! !"## ∩ %&"'( )*' +ℎ& +&#+ = ! !"##
×! %&"'( )*' +ℎ& +&#+ !"## = 90%×90% = 81%
§ ! !"## ∩ 456(7 + 8&"'( = ! !"## ×!(456(7 + 8&"'(|!"##)
= 90%×10% = 9%
§ ! <"58 ∩ %&"'( )*' +ℎ& +&#+ = ! <"58
×! %&"'( )*' +ℎ& +&#+ <"58 = 10%×5% = 0.5%
§ ! <"58 ∩ 456(7 + 8&"'( = ! <"58 ×!(456(7 + 8&"'(|<"58)
= 10%×95% = 9.5%
§ ! %&"'( )*' +ℎ& +&#+ = ! !"## ∩ %&"'( )*' +ℎ& +&#+
+ ! <"58 ∩ %&"'( )*' +ℎ& +&#+ = 81% + 0.5% = 81.5%
§ ! 456(7 + 8&"'( = ! !"## ∩ 456(7 + 8&"'(
+ ! <"58 ∩ 456(7 + 8&"'( = 9% + 9.5% = 18.5%

© Ben Galili IDC


- -.//∩12.34 563 782 72/7
§ ! !"## $%"&' ()& *ℎ% *%#* =
-(12.34 563 782 72/7)
;<%
= = 99%
;<.?%
- D.EF∩12.34 563 782 72/7
§ ! A"BC $%"&' ()& *ℎ% *%#* =
-(12.34 563 782 72/7)
G.?%
= = 1%
;<.?%
- -.//∩LEM4N 7 F2.34 O%
§ ! !"## IBJ'K * C%"&' = = = 49%
-(LEM4N 7 F2.34) <;.?%
- D.EF∩LEM4N 7 F2.34 O.?%
§ ! A"BC IBJ'K * C%"&' = = = 51%
-(LEM4N 7 F2.34) <;.?%

© Ben Galili IDC


¡ Independent events
§ If ! " ∩ $ = ! " ! $ then A & B are independent
§ From conditional probability we get:
! "∩$
! "|$ =
!($)

! " ∩ $ = ! "|$ !($)


§ If A & B are independent:
! " ! $ = ! " ∩ $ = ! "|$ !($)
! " = ! "|$
* And also ! $ = ! $|"
© Ben Galili IDC
¡ The likelihood is the class conditional
information – the probability of an instance,
given the class
§ for an instance x, and 2 possible classes A, B:
P(x|B)
P(x|A)
If x=12, we’ll predict B,
because P(x|B)>P(x|A)

© Ben Galili IDC


¡ If we return to the previous example (fail \
pass) it is like asking:
§ What is the probability that someone learn to the
test if he pass the exam
¡ But, we wanted to know what is the
probability to pass \ fail the exam if you’ll
learn
¡ So we need a way to go from likelihood to
posterior probability
© Ben Galili IDC
¡ Bayes rule:
!(#|")!(")
! "# =
!(#)

¡ With this rule we can convert the likelihood to the


posterior probability, if we have also the prior
probability
¡ A classifier that classify A if P(A|x)>P(B|x), is a classifier
that maximize the posterior probability – MAP
¡ The classification with MAP depends on both the
likelihood and the prior probabilities

© Ben Galili IDC


¡ So if we want to classify according to MAP:
§ We will classify A if
!(#|")!(") !(#|))!())
! "# = > =! )#
!(#) !(#)
! # " ! " > !(#|))!())
§ Note that P(x) is removed from both side’s
denominator simply because it is the same

© Ben Galili IDC


¡ This classification rule is minimizing the error:
§ If we classify B, then the ! "##$# % = ! ' %
§ If we classify A, then the ! "##$# % = ! ( %
¡ But, we classify B only if ! ( % > ! ' % ,
and therefore the probability of the error is
minimal
! "##$# % = min[! ' % , ! ( % ]

© Ben Galili IDC


¡ We can define a loss measure for wrong
decision:
§ 0-1 loss (the simplest one):
1, ./ . ≠ 1
λ"# = λ %ℎ''() *" *# =+
0, ./ . = 1

© Ben Galili IDC


¡ After we defined the loss we can define the risk,
which is the expected loss (for k classes):
/

! "ℎ$$%& '( ) = + λ(, 1(', |)) = + 1(', |))


,-. ,5(
= 1 − 1('( |))
¡ Classifier that wants to minimize the risk will
choose '( such that:
1 '( ) > 1 ', ) ∀: ≠ <

© Ben Galili IDC


¡ We can use Bayes rule even for multi-class
problem:
%(#|&" )%(&" )
!" # = % &" # = .
∑+,- %(#|&+ ) %(&+ )
¡ The denominator is the same for all !" # , so
it can be dropped:
!" # = %(#|&" )%(&" )

© Ben Galili IDC


¡ In order to make the classification process
more efficient we can use ln():
!" # = ln ' # (" ' ("
= ln ' # (" + ln ' ("
¡ It helps reduce multiplication of small number
(0-1) and to deal better with normal
distribution * +(-)
¡ We can do it because ln() is monotonically
increasing
© Ben Galili IDC
¡ If we’re going back to the regression task, we can
define the hypothesis to be any function ℎ " : $ → &
that belongs to the hypothesis space ℎ ∈ (
¡ We want to find the most probable hypothesis
¡ This is a conditional probability problem – find the
hypothesis that maximize
) ℎ* =) *ℎ ) ℎ
* posterior probability
¡ We will assume all ℎ ∈ ( have the same prior
probability, and we’ll get that the most probable ℎ will
be found according to maximum likelihood:
ℎ,- = argmax 5 * ℎ
3∈4
© Ben Galili IDC
¡ Assuming the instances are independent:
! " ℎ = % ! '& ℎ
&
¡ If the error has normal distribution
(& ~*(,, .), then we can say that the
probability that ℎ 0& = '& is the same as the
probability (& = 0 according to the normal
distribution of (&
© Ben Galili IDC
¡ And we get:
ℎ"# = argmax - . ℎ = argmax / 1 20 ℎ
*∈, *∈,
0
1 : ;< 9= ?
9
= argmax / 8 7 >
*∈,
0
256 7
?
1 : * @< 9A<
9
= argmax / 8 7 >
*∈,
0
256 7
© Ben Galili IDC
;
1 6 * 78 598
5
ℎ"# = argmax - 4 3 :
*∈,
.
212 3
;
1 6 * 78 598
5
ℎ"# = argmax ln - 4 3 :
*∈,
.
212 3
3
1 1 ℎ @. − A.
ℎ"# = argmax > ln −
*∈, 212 3 2 2
.
3
1 ℎ @. − A.
= argmax > −
*∈, 2 2
.

= argmax > − ℎ @. − A. 3
*∈,
.

= argmin > ℎ @. − A. 3
*∈,
.

© Ben Galili IDC


¡ Prior classifier: ! " > !(%)
¡ ML classifier: ! '|" > !('|%) – assuming
! " = !(%)
¡ MAP classifier:
! "|' = ! '|" ! " > ! '|% ! % = ! %|'
* Drooping ! ' from the denominator

© Ben Galili IDC


¡ Parametric models
§ If we know \ can guess the distribution type we
can estimate the parameters of the distribution
¡ Non parametric models
§ Histogram (=count…)
§ Naïve Bayes

© Ben Galili IDC


¡ For each class we will estimate the distribution
parameter according to the train dataset
¡ If we’re talking about normal distribution
parameters, we need to estimate the mean and
the variance:
)
1
! = % *&
$
&'(
)
1
+, = % (*& − !),
$
&'(
© Ben Galili IDC
¡ Now, we can estimate the parameter for each likelihood
probability, for each class:
1
!" = ',
|&" |
(∈*+
.
1
-" = ' (, − !" ).
|&" |
(∈*+
¡ And then classify according to the largest probability given
by the normal distribution formula:
:
1 7 (68 +
6. 9
2 ,|&" = 5 +

24-".
© Ben Galili IDC
¡ But, this was good only for 1 attribute
¡ What if we have more than 1?
¡ In this case each likelihood probability will be
estimated according to multivariate normal
distribution
¡ For this we will need mean vector (each
dimension will be the mean for some
attribute) and the covariance matrix

© Ben Galili IDC


Variance

és 11 s 12 ! s 1d ù é s 12 s 12 ! s 1d ù
ês ú ê ú
ê 21 s 22 ! " ú ês 21 s 22 ! " ú
S= =
ê " " # " ú ê " " # " ú
ê ú ê 2 ú
ës d 1 s d 2 ! s dd û êës d 1 s d 2 ! s d úû

S - is the determinant of the covariance matrix


-1
S - is the inverse matrix of the covariance matrix
© Ben Galili IDC
¡ For each attribute we will find the mean and
the variance as before and we will create the
mean vector and the covariance matrix
¡ We will classify according to the multivariate
normal distribution:
1 2
̅ 5 6 7 89 314
1 314 ̅ 5
! #|%
̅ & = 0 /
(2+)- |.|/

© Ben Galili IDC


¡ If we don’t know the type of distribution?
¡ We need another way to estimate the
probabilities ! "|$% and ! $%
¡ The prior probability ! $% can be estimated
from the classes frequency in the training set
¡ But what with the likelihood?

© Ben Galili IDC


¡ In order to estimate the likelihood for a given
instance we need a huge dataset
¡ If we have d attributes the number of possible
terms in the likelihood ! "# , "% , … , "' |)* is
+ , |-# | , |-% | ,,, |-' |
¡ We need a way \ assumption to overcome this
problem

© Ben Galili IDC


¡ If we assume that all attributes are independent given the class,
we will get:
'

! "# , "% , … , "' |)* = , ! "- |)*


-.#
¡ And now we can find the MAP:
'

/01 = argmax !()* ) , ! "- |)*


*
-.#
¡ In this assumption we lower the necessary size of the dataset to
'

9 : /-
-.#

© Ben Galili IDC

You might also like