0% found this document useful (0 votes)
30 views87 pages

ML - Unit 1

Uploaded by

pullurishikhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views87 pages

ML - Unit 1

Uploaded by

pullurishikhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

1

Machine learning1

What is Machine Learning?


Machine learning is a branch of artificial intelligence.
Here, we feed data into Algorithms; inturn they identify patterns in data, which
are then used to make accurate predictions.
The process, which relies on algorithms and statistical models to identify
patterns in data, doesn’t require consistent, or explicit programming.
It’s then further optimized through trial and error and feedback.

meaning, machines learn by experience and increased exposure to data,


much the same way humans do.

Dt science life cycle

1
2

AI - ML &DS - ANN - DL - GenAI Architecture

What is Generative AI?

Generative AI is a branch of artificial intelligence, holding the power to bring

new creations to life. Instead of solely analyzing and understanding data,

GenAI is all about generating fresh content and information. It learns from
3

existing data, identifies patterns and structures, and then uses this knowledge

to craft something entirely new.

What sets Generative AI apart is its ability to go beyond interpretation and

create unique outputs. Whether it’s writing a story, composing a melody, or

designing an image, GenAI goes beyond imitation.

Here are some key aspects of generative AI:

Applications:
Used in a wide range of industries. Banking industries. Identify fraud;
processing loans. Health care industries: assess the possibility of a person
4

running into the Health issues. smart devices quickly process human
conversations through

Machine Learning algorithms consist of three parts:


1)a decision process that makes classifications based on input data,
2) an error function to evaluate predictions and
3) adjust for accuracy and a model optimization process that adds weights to
various factors in order to reduce discrepancies between the model’s estimate
and the example.
4 Types of Machine Learning:
Supervised, unsupervised, semi supervised and reinforcement Learning

1. SUPERVISED LEARNING

Supervised learning involves training a machine and its algorithm using


labeled training data. This requires a significant amount of human guidance.
It’s one of the most popular forms of machine learning and is able to train
models to accomplish tasks in classification, regression or forecasting.
Supervised learning is commonly used to create recommender systems, detect
inbox spam and predict stock and housing market values.

Data must be divided into features (the input data) and labels (the output
data).

Features describe individual, measurable units of data, such as height, salary,


colors or animal breeds.

Supervised learning serves as an umbrella for specific algorithms and


statistical methods. Here are a few that fall under supervised learning.
5

Classification

Used to further categorize data, classification algorithms are a great tool to


sort, and even hide, that data. (Ex: large email client, you may notice that
some emails are automatically redirected to a spam or promotions folder,
essentially hiding those emails from view.)

A few popular classification algorithms used to sort data includeK-Nearest


neighbor (KNN), naive Bayes classifier algorithms, support vector machine
(SVM) algorithms, decision trees and random forest models.

Regression

Regression algorithms are frequently used tools for forecasting trends. These
algorithms identify relationships between outcomes and other independent
variables to make accurate predictions. Linear regression algorithms are the
most widely used, but other commonly used regression algorithms include
logistic regressions

Use case is :finger print analysis


6

1. Regression model (x ind. Variable, y dependent var) y = f(x). x is


continuous variable
2. Classification model. Output is categorical data type.
Dependent variable vs ind. Variable
Ex: we are trying to classify the data. Pass/ detained

2. UNSUPERVISED LEARNING

UnsupervisedLearning
7

No class labels.
Input data = X (ind. Pendent var)
Output data = Y(dependent var)
Y = f(X)
We determine how dependent var change with ind. Variables

With unsupervised learning, raw data with no labels is processed by the


system, meaning less work for humans. Unsupervised learning algorithms
discover patterns or anomalies in large, unstructured data sets that may
otherwise go undetected by humans. This makes it applicable for
accomplishing tasks related to clustering or dimensionality reduction.

Unsupervised learning algorithms work by analyzing available data and


grouping information based on similarities and differences, thus creating
relationships between data points. Customer and audience segmentation,
computer vision and breach detection can all apply unsupervised learning.

Unsupervised learning algorithms work by analyzing available data and


grouping information based on similarities and differences, thus creating
relationships between data points.
Ex: Customer and audience segmentation,
Computer vision and
Fraud / breach detection
These two types of unsupervised learning methods are among the most
common.
8

Even though there is no class labels, we are able divide into clusters consisting of similar items
9

Clustering

These algorithms focus on similarities within raw data, and then groups that
information accordingly. More simply, these algorithms provide structure to
raw data. Clustering algorithms are often used with marketing data to garner
customer (or potential customer) insights, as well as for fraud detection.

Examples of clustering algorithms include 1)hierarchical clustering and

2) k-means clustering.

Dimensionality Reduction

Dimensionality reduction is the process of reducing the amount of features


within a data set, while preserving important properties of the data. This is
done to reduce processing time, storage space, complexity and overfitting in a
machine learning model.

The 2 main methods for applying dimensionality reduction:


10

1) Feature selection: Feature selection involves selecting a subset of


relevant features from the original feature set to use as input into a
model

This helps simplify the model and improve the accuracy of outputs

2) Feature Extraction: Feature extraction involves extracting new,


significant features from the original raw data for input, which focuses
on cutting through redundant data
11

Popular dimensionality reduction algorithms include principal component


analysis (PCA), non-negative matrix factorization (NMF), linear discriminant
analysis (LDA) and generalized discriminant analysis (GDA).

3. SEMI-SUPERVISED LEARNING

Semi-supervised learning offers a balanced mix of both supervised and


unsupervised learning. With semi-supervised learning, a hybrid approach is
taken as small amounts of labeled data are processed alongside larger chunks
of raw data. This strategy essentially gives algorithms a head start when it
comes to identifying relevant patterns and making accurate predictions when
compared with unsupervised learning algorithms, without the time, effort and
cost associated with more labor-intensive supervised learning algorithms.

Ex: fraud detection, 2) speech recognition 3) text document classification.


Because semi-supervised learning uses labeled data and unlabeled data, it
12

often relies on modified unsupervised and unsupervised algorithms trained


for both data types.
Self-Training

Self-training algorithms use a pre-existing, supervised classifier model, known


as a pseudo-labeler, that’s trained on a small portion of labeled data in a set.
The pseudo-labeler is then used to make predictions on the remainder of the
dataset, which is unlabeled. Labels produced from this process are called
pseudo-labels, and are added back into the labeled dataset. These actions are
done repeatedly by the model until all data samples are labeled or there are no
more to label, improving its accuracy over time.

Label Propagation

Label propagation algorithms assign labels to unlabelled observations by


propagating, or allocating, labels through a dataset over time, usually in a
graph neural network. These datasets tend to start with a small section already
having labels, and assign labels based on direct connections between these
data points in the graph. Label propagation can be used to quickly identify
communities, detect abnormal behavior or accelerate marketing campaigns.
For example, if one customer on a graph likes a certain product, a customer
branched directly off of them may also like it.

4.REINFORCEMENT LEARNING

With reinforcement learning, AI-powered computer software programs are


outfitted with sensors, commonly referred to as intelligent agents, that
respond to their surrounding environment to make decisions independently
13

that achieve a desired outcome. (Think simulations, computer games and the
real world.)

Intelligent agents are self-trained by being rewarded for desired behaviors or


punished for undesired behaviors. By perceiving and interacting with their
environment, these agents learn through trial and error, ultimately reaching
optimal proficiency through positive reinforcement during the learning
process. Reinforcement learning is often used in robotics and self-driving cars,
helping machines acquire specific skills and behaviors.

These are some of the algorithms that fall under reinforcement learning.

Q-Learning

Q-learning is a reinforcement learning algorithm that does not require a


model of the intelligent agent’s environment. Q-learning algorithms iteratively
calculate the value of actions based on rewards resulting from those actions,
which improves outcomes and behaviors over time.

[Q-learning (Watkins, 1989) is a method for optimizing (cumulated) discounted reward,


making far-future rewards less prioritized than near-term rewards. R-learning (Schwarz,
1993) is a method for optimizing average reward, weighing both far-future and

near-term reward the same]

Deep Reinforcement Learning

Used in the development of self-driving cars, video games and robots, deep
reinforcement learning combines deep learning — machine learning based on
14

artificial neural networks — with reinforcement learning where actions, or


responses to the artificial neural network’s environment, are either rewarded
or punished. With deep reinforcement learning, vast amounts of data and
increased computing power are required.

Reinforcement learning

Self driving cars


Basically it has 3 sensors
Camera : blind spot, rear view mirror

IOT connectivity thru cloud data


15

Supervised Learning
Logistic regression:

Overview of Classification
16

Classification is a supervised machine learning process of categorizing a given set of


input data into classes based on one or more variables. Additionally, a classification
problem can be performed on structured and unstructured data to accurately predict
whether or not the data will fall into predetermined categories.

Classification in machine learning can require two or more categories of a given data
set. Therefore, it generates a probability score to assign the data into a specific
category, such as spam or not spam, yes or no, disease or no disease, red or green

Some Applications of Machine Learning Classification Problems

● Image classification

● Fraud detection

● Document classification

● Spam filtering

● Facial recognition

● Voice recognition

● Medical diagnostic test

● Customer behavior prediction

● Product categorization

Types: in determining handwriting characters, identifying spam, and so on, the


classification requires training data with a large number of datasets of input and output.
The most common classification algorithms are binary classification, multi-class
classification, multi-label classification, and imbalanced classification, which are
described below.
1) Binary Classification: Binary is a type of problem in classification in machine
learning that has only two possible outcomes. For example, yes or no, true or
false, spam or not spam, et
2) Multi-Class Classification
17

Multi-class is a type of classification problem with more than two outcomes and does
not have the concept of normal and abnormal outcomes. Here each outcome is
assigned to only one label. For example, classifying images, classifying species, and
categorizing faces, among others. Some common multi-class algorithms are choice
trees, progressive boosting, nearest k neighbors, and rough forest.

3) Multi-Label Classification

Multi-label is a type of classification problem that may have more than one class
label assigned to the data. Here the model will have multiple outcomes. For
example, a book or a movie can be categorized into multiple genres, or an image
can have multiple objects. Some common multi-label algorithms are multi-label
decision trees, multi-label gradient boosting, and multi-label random forests.

4) Imbalanced Classification

Most machine learning algorithms assume equal data distribution. When the
data distribution is not equal, it leads to imbalance. An imbalanced classification
problem is a classification problem where the distribution of the dataset is
skewed or biased. This method employs specialized techniques to change the
composition of data samples. Some examples of imbalanced classification are
spam filtering, disease screening, and fraud detection.

Real time example of classification problem:


Let us first take a look at the Regression model
18
19
20

How logistic regression can be applied by classifying data

Logistic Regression is a statistical model that uses a logistic


function(logit) to model a binary dependent variable (target variable).
Like all regression analyses, the logistic regression is a predictive
analysis. It classifies the outcome by calculating the probability of that
event to occur.

It works well with large amounts of data. Prediction


accuracies will come down with the smaller size of the data
sets
21

Why we call it Regression and not classification


It is important to clarify the distinction between regression and
classification models. Regression models predict a continuous
variable. They can also predict probabilities. A probability-predicting
regression model can be used as part of a classifier by imposing a
decision rule(eg. if p>0.5 then 1 else 0), which is what a Logistic
Regression exactly does…
Here comes the concept of Odds Ratio and log of Odds:
If the probability of an event occurring (P) and the probability that it
will not occur is (1-P)
Odds Ratio = P/(1-P)
Taking the log of Odds ratio gives us:
Log of Odds = log (p/(1-P))
This is nothing but the “logit” function

Fig 3: Logit Function heads to infinity as p approaches 1 and towards negative infinity as it
approaches 0.

Note: Probability ranges from 0 to 1 Odds range from 0 to ∞ Log


Odds range from −∞ to ∞
22

That is why the log odds are used to avoid modeling a variable with a
restricted range such as probability.

Logit()
and Sigmoid()
The logit function maps probabilities to the full range of real numbers

required prior to modeling.

The inverse of the logit function is the sigmoid function. That is, if

you have a probability p, sigmoid(logit(p)) = p. The sigmoid

function maps arbitrary real values back to the range [0, 1]. We can

also say sigmoid function as the generalized form of logit function.

Thus the result obtained from


the sigmoid function ([0,1)] is then passed through a decision rule to
divide the outcome into classes as required.
23
24
25

We are building the model with the primary aim to reduce the error
26

We need to find out the best regression curve

How?
27

1)

2)
28
29
30

For (1) spam case

1
Log ( ) = log(1) – log(0) = 0 – ( – infinity) = +Infinity
0
31

Log b 0 = c
c
=> 0= b
If b<1 c = (extremely large) → + infinity
1000 100
Compare, (0.1) and (0.1) more the exponent value, the

small it is. So , c = + ∞

If b > 1 c = (extremely small) → – ∞


By using the extended real number system, we get that log(0) equals

negative infinity, – ∞. Infinity does not exist in the set of real


numbers.
(2) no spam case:

Now, proceed with the regression line:


32

Now, with sigmoid function,


33
34
35
36

Is this the best curve? We should determine based on the maximum likelihood comes into
picture
37

So, likelihood for entiredata is = multiply all of them together

Product = 0.99 * 0.99* 0.97…..


Log (likelihood) log (0.99* 0.99* 0.97…)
= log (0.99) + log (0.99) + log (0.97)+ …
= 0.82
38
39

Keep rotating the regression line….


Whichever line yields the max. Value that is MLE
40

Box Plots
41
42

Decision Trees
43
44
45
46
47

Perceptron
There are differences between the Biological Neuron and Artificial neuron.

Mcculloch and Pitts Model of neurons and limitations.


One question that is worth considering is how realistic is this model of a
neuron? The answer is: not very. Real neurons are much more
complicated.
48

Inputs: The inputs to a real neuron are not necessarily summed linearly:
there may be non-linear summations.
Real neurons do not output a single output response, but a spike train, that
is, a sequence of pulses, and it is this spike train that encodes information.

This means that neurons don’t actually respond as threshold devices, but
produce a graded output in a continuous way.
They do still have the transition between firing and not firing, though, but
the threshold at which they fire changes over time.
the neurons are not updated sequentially according to a computer clock,
but update themselves randomly (asynchronously).

Note that the weights, wi can be positive or negative. This corresponds to


excitatory and inhibitory connections that make neurons more likely to fire
and less likely to fire, respectively.
49

So in order to make a neuron learn, How should we change the


weights and thresholds of the neurons so that the network gets the
right answer more often?

Weights define the relative importance of the weights


50

Bias is a constant used for scaling purposes


This seems obvious at first: some of the weights will be too big if the neuron fired when it
shouldn’t have, and too small if it didn’t fire when it should. So we compute yk −tk (the difference
between the output yk, which is what the neuron did, and the target for that neuron, tk, which is
what the neuron should have done. This is a possible error function).
We gave each neuron a firing threshold θ that determined what value it needed before it should
fire.

However, before we get to that, the learning rule needs to be finished—we need
to decide how much to change the weight by. This is done by multiplying the
value above by a parameter called the learning rate, usually labeled as η. The
value of the learning rate decides how fast the network learns. It’s quite
important, so it gets a little subsection of its own (next), but first let’s write down

the final rule for updating a weight wij :


wij ← wij − η(yj − tj ) · xi (1)

, typically 0.1 < η < 0.4,


If η ==1 too fast learning and the wts change so quick that the model
becomes unstable. so that it never settles down
If η == 0 too low, it takes a longer learning.

The Perceptron Learning Algorithm


51

1.Initialization:
Small random wts are considered to start with.
Wts refer to the relative importance of the inputs
2.Training:

for T iterations or until all the outputs are correct: ∗ for each input vector: ·
compute the activation of each neuron j using activation function g:
52

update each of the weights individually using: wij ← wij − η(yj − tj ) · xi

3.recall
compute the activation of each neuron j using: yj = g Xm i=0 wijxi ! = 1 if wijxi > 0 0 if wijxi ≤ 0

Computing the computational complexity of this algorithm

Computing the computational complexity of this algorithm is very easy. The recall phase
In1 In2 target
0 0 0
0 1 1
1 0 1
1 1 1

Data for the OR logic function and a plot of the four data points

we’ll pick w0 = −0.05, w1 = −0.02, w2 = 0.02. Now we feed in the first input, where both inputs
are 0: (0, 0). Remember that the input to the bias weight is always −1, so the value that reaches
the neuron is −0.05 × −1 + −0.02 × 0 + 0.02 × 0 = 0.05. This value is above 0, so the neuron
fires and the output is 1, which is incorrect according to the target.

The update rule tells us that we need to apply Equation (1) to each of the weights separately
(we’ll pick a value of η = 0.25 for the example):
(Bias) w0 : −0.05 − 0.25 × (1 − 0) × −1 = 0.2,
(input1) w1 : −0.02 − 0.25 × (1 − 0) × 0 = −0.02,
(input2) w2 : 0.02 − 0.25 × (1 − 0) × 0 = 0.02.
53

Now we feed w0,w1,w2 in the next input (0, 1) and compute the output (check that you agree
that the neuron does not fire, but that it should) and then apply the learning rule again:
w0 : 0.2 − 0.25 × (0 − 1) × −1 = −0.05,
w1 : −0.02 − 0.25 × (0 − 1) × 0 = −0.02,
w2 : 0.02 − 0.25 × (0 − 1) × 1 = 0.27.
For the (1,0) input the answer is already correct. so we don’t have to update the weights at all,
and the same is true for the (1,1) input.

Decision Boundary/ Discriminant


. Figure shows the decision boundary, which shows when the decision about which class to
categorize the input changes from crosses to circles.
54

LINEAR SEPARABILITY

What does the Perceptron actually compute? For our one output neuron example of the OR
data it tries to separate out the cases where the neuron should fire from those where it
shouldn’t.

In fact, that is exactly what the Perceptron does: it tries to find a straight line (in 2D, a plane in
3D, and a hyperplane in higher dimensions) where the neuron fires on one side of the line, and
doesn’t on the other. This line is called the decision boundary or discriminant function,

matrix notation. consider just one input vector x. The neuron fires if x·wT ≥ 0 (where w is
the row of W that connects the inputs to one particular neuron; x is the input Vector
Getting back to the Perceptron, the boundary case is where we find an input vector x1 that has
x1 · wT = 0. Now suppose that we find another input vector x2 that satisfies x2 · wT =0.
Putting these two equations together we get:
T
x1.w = 0
For circles class, => dot product

For crosses, x2.w T = 0

x1.w T = x2.w T

Or (x1-x2).w T = 0

|| x1 - x2 || . w T . cos (theta) = 0
55

Cos (theta) = 0 ; theta = 90 / -90 degrees

Now x1 − x2 is a straight line between two points that lie on the decision boundary, and the
T
weight vector w must be perpendicular to that

Linear classifier and margins:Margins

A linear classifier is defined by a hyperplane's normal vector w and an offset b, i.e.


the decision boundary is {x | w ⊤ x + b = 0} (thick line). Each of the two halfspaces
defined by this hyperplane corresponds to one class, i.e. f (x) = sign(w ⊤
x + b).
The margin of a linear classifier is the minimal distance of any training point to the
hyperplane. In this case it is the distance between the dotted lines and the thick line.
56

Single Perceptron: A linear function


57

Multi Layer Perceptron: A nonlinear function; multiple hidden layers


Single layer Perceptrons can learn only linearly separable patterns. Multilayer
Perceptron or feedforward neural network with two or more layers have the greater
processing power and can process non-linear patterns as well. Perceptrons can
implement Logic Gates like AND, OR, or XOR.
58

Activation functions : convert from linear to non linear models

How to draw a linear decision boundary for this set up , which separates these 2
classes?
From going to 3D the model takes a non linear pattern

Introduce a new dimension, now it is 3 D which makes easy for us to linearly separable
model
59
60

Perceptron Training

The error function that we used for the Perceptron was

This is less sensitive and also, if you do not take the abs values of the errors, that causes the
problem.
sum-of-squares error function is the right candidate.

if we differentiate a function, then it tells us the gradient of that function, which is the direction
along which it increases and decreases the most.
So if we differentiate an error function, we get the gradient of the error
Since the purpose of learning is to minimize the error, following the error function downhill (in
other words, in the direction of the negative gradient) will give us what we want.
61

Local minimas at 1 and 2. The weights of the network are trained so that the error goes downhill
until it reaches a local minimum, just like a ball rolling under gravity.
Gravity will make the ball roll downhill (follow the downhill gradient) until it ends up in the bottom
of one of the hollows. These are places where the error is small, so that is exactly what we
want. This is why the algorithm is called gradient descent.

For the gradient descent algorithm to reach the local minimum we must set
the learning rate to an appropriate value, which is neither too low nor too
high. This is important because if the steps it takes are too big, it may not
reach the local minimum because it bounces back and forth between the
convex function of gradient descent (see left image below). If we set the
learning rate to a very small value, gradient descent will eventually reach the
local minimum but that may take a while (see the right image).
62

So what should we differentiate with respect to? There are only three things in the network that
change: the inputs, the activation function that decides whether or not the node fires, and the
weights. The first and second are out of our control when the algorithm is running, so only the
weights matter, and therefore they are what we differentiate with respect to

1.The threshold function that we used for the Perceptron. Note the discontinuity where the value
changes from 0 to 1.
2. The sigmoid function, which looks qualitatively fairly similar, but varies smoothly and
differentiably.
63
64
65
66

Error in Prediction:

Pass 1:
67

Pass 2:

Pass: 3
68

Probability
A measure of uncertainty
Uncertainty refers to the lack of complete knowledge about the world
or a specific situation. For example, if we're unsure whether a
condition A is true or false, we can't definitively conclude the outcome
of another condition B based on A. This uncertainty arises from factors
like incomplete information or ambiguity. To address such uncertainty,
we employ techniques like uncertain reasoning or probabilistic
reasoning, which allow us to make informed decisions even when
dealing with incomplete or uncertain knowledge.
69

Joint probability vs Conditional probability vs Marginal


probability

Joint probability is the probability of two events occurring


simultaneously.
Marginal probability is the probability of an event irrespective of the
outcome of another variable.
Conditional probability is the probability of one event occurring in
the presence of a second event.

● Independent: Each event is not affected by other


events. Example: Tossing a coin two times. The
outcome of tossing the coin for the first time will not
affect the outcome of the second event.
● Dependent: (also called Conditional) An event is
affected by other events. Example: Drawing 2 Cards
from a Deck. After taking one card from the deck there
are fewer cards available, so the probabilities change!
● Mutually Exclusive: Two events can’t happen at the
same time. Example: We can not play football &
rugby at the same time in the same football ground.

‌Joint Probability

Joint probability is the likelihood of more than one event


occurring at the same time P(A and B). The probability of event
70

A and event B occurring together. It is the probability of the


intersection of two or more events written as p(A ∩ B).

Conditions for Joint Probability

● One is that events X and Y must happen at the same


time. Example: Throwing two dice simultaneously.
● The other is that events X and Y must be independent of
each other. That means the outcome of event X does not
influence the outcome of event Y.
Example: Rolling two Dice.
● If these conditions met, then P(A∩B) = P(A) * P(B).

Conditional Probability

‌The conditional probability of an event B is the probability that


the event will occur given the knowledge that an event A has
already occurred. It is denoted by P(B|A).

So now, The joint probability of two dependent events becomes

‌P(A and B) = P(A)P(B|A)


71

Types of Probabilistic Reasoning


The main reason for probabilistic reasoning to come into existence is Uncertainty.
Uncertainty can be defined as defects in various factors, like experimental errors, faults
in tools, less data, and many more.
To resolve the issue of uncertainty, there are two ways.

1. Bayes' rule

2. Bayesian Statistics

Bayes' Rule
Bayes' rule is known as the basic rule of probability.
(Thomas Bayes, an 18 th century Mathematician)
It helps the model to update itself along with the prior knowledge and
change the new evidence. We use the Bayes' Rule in many fields
nowadays, like for prediction, classification, and decision-making jobs
where uncertainty needs to be handled.
The mathematical formula of Bayes' Rule is as follows.

P (A | B) = P ( B | A) * P (A) / P (B)

where the probability of A is determined given the probability of B is


already known. A and B are the events which are dependent.
This is already known as Conditional Probability:
P(A | B) = P (A ⋂ B)/ P(B), where P(B) ≠ 0
72

If A and B are the events which are independent.


P(A| B) = P(A)

Ex 2:
The probability of a student passing in science is ⅘ and the of the
student passing in both science and mathematics is ½. What is the
73

probability of that student passing in mathematics knowing that he


passed in science?

Solution:

Let A ≡ event of passing in science

B ≡ event of passing in maths

Given, P(B) = ⅘ and P(A ∩ B) = ½


Then, probability of passing maths after passing in science

= P(B|A) = P(A ∩ B) / P(A)

=½÷⅘=⅝
∴ the probability of passing in maths is ⅝.

Joint probability: when both of the events are occurring at the same
time, then the probability of A and B is:
P(A and B) = P(A ⋂ B)
74

Examples of Conditional Probability:


Conditional probability is a method to predict the outcome of an
experiment, based on partial information.

Question: You throw a 6-sided dice. You have a partial info of the roll
that it is an even number. How likely is it a 6?

We know that there are 6 possible outcomes from a roll (1, 2, 3, 4, 5, 6)

After the information is known that the throw is even-numbered, we

are left with 3 possible outcomes — 2, 4, 6.

Thus, out of three possible scenarios, the probability of throwing a 6

given that the roll is an even number is 1/3.

We can now say that if all outcomes are equally likely, as is the case of

the dice roll above, we can say that


75

The probability of occurrence of any event A when another event B in relation to A has
already occurred is known as conditional probability.
It is denoted by P(A|B).
Sample space is given by Universe U, and there are two events A and B. In a situation
where event B has already occurred, then our sample space U naturally gets reduced to
B because now the chances of occurrence of an event will lie inside B.
Question:
Event A represents eating Cake for breakfast, event B represents eating Pizza for
lunch.
On a randomly selected day, probability of eating a cake for breakfast is 0.6,
the probability of eating a pizza for lunch is 0.5. the conditional probability that having a
cake for breakfast , given that having pizza for lunch is 0.7
based on this info, find the conditional probability having pizza for lunch,
given that cake is consumed for breakfast?

P(A) = 0.6 P(B) = 0.5

P(A | B) = 0.7 P( B | A) = ?

so, P(A) and P(A | B ) are dependent events.


find P(B | A) ?
76

P(B | A) = P(A ⋂ B) / P (A)


P(A | B) = P(B ⋂ A) / P (B)

P(B | A) = P(A | B) * P(B) / P(A)


=0.58

email password has 8 lower-case letters. If all password combinations


are possible, what is the chance that the password is ‘password’?

(c) How likely is it that a person has HIV-positive given that the
person was tested negative for HIV?

(d) You built a machine learning model to classify pictures of


chihuahua and muffin. Your classification model informs you that
picture X is a photo of a chihuahua ( a small pet dog). How likely is it
for the model to be wrong?

Let’s compare two scenarios.. By default, there is no limit on the

length of the password.


77

Scenario A) Without any additional information, what is the


chance that email password is ‘1234’?

In this scenario, the probability that password is 1234 is close to zero,

or even infinitely small. This is because there are infinite combinations

of passwords out there, and 1234 is simply one of them.

Scenario B) Given that the email password contains only the


numbers 1, 2, 3, 4, what is the chance that the password is
‘1234’?

In scenario B, we can imagine transporting into a universe where all


email passwords must have 4 characters and contain 1, 2, 3 and 4.
Now, in this world, we ask ourselves, what is the probability that these
4 numbers are arranged in the sequence 1234?

n this world, the passwords only be 1 of the 24 combinations in 1234,

1243, 1342, 1324, 1423, 1432, 2134, … ,4321

Thus, the conditional probability is 1/24.

Conditional Probability in Machine Learning and


Confusion Matrix
78

Conditional probability is useful in the context of evaluating machine

learning models.

Chihuahua vs Muffin: Images from ImageNet

The following results are obtained.


79

Now, you feed a new picture to your machine learning model. Given

your classification model predicts that picture X is a photo of a

chihuahua, how likely is it that the photo is actually a photo of a

chihuahua? Mathematically, this translates to the following —

Find,
P( image actually is a chihuahua | image predicted as a chihuahua)

= 3/4
80

Steps:

1. Since it was given that the model made a prediction of

‘chihuahua’, we will only focus on the yellow highlighted

column and ignore the other column. In other words, we

transport ourselves to a ‘new universe’ where the model only

can predict that the picture is a chihuahua.

2. In this new universe, the model made 600 predictions in

total.

3. Out of the 600 predictions made, 450 of those are correct,

while 150 are incorrect.

4. Thus, the conditional probability is 450/600, which

simplifies to 3/4.
81

The table is a confusion matrix, commonly used in machine

learning. A confusion matrix is a table that tells us how well our model

has performed after it has been trained.

A confusion matrix is a table with two rows and two columns that

reports the number of false positives, false negatives, true positives,

and true negatives.

An understanding of the confusion matrix and its associated metrics

require an understanding of conditional probability.

In fact, what we have just calculated is a metric called precision. In

fact, precision, and some other metrics, are formalized as such.


82

A confusion Matrix

Precision =

P( X predicted as and actually is positive | X is predicted as positive)

Recall =

P(X predicted as and actually is positive | X is actually positive)

Specificity =

P(X predicted as and actually is negative| X is actually negative)

Bayes theorem:
Bayes theorem is used to determine the conditional probability of
event A when event B has already happened.
* (Thomas Bayes, an 18 th century Mathematician)

Bayes Theorem can be derived for events and random variables separately using the
definition of conditional probability
83

Bayesian classification is a probabilistic approach to learning and


inference based on a different view of what it means to learn from
data, in which probability is used to represent uncertainty about the
relationship being learnt.

From the definition of conditional probability, Bayes theorem can be derived for events
as given below:

P(A|B) = P(A ⋂ B)/ P(B), where P(B) ≠ 0

P(B|A) = P(B ⋂ A)/ P(A), where P(A) ≠ 0

(Where, P (A) , P(B) are the probabilities of events, A and B)

Here, the probability P(A ⋂ B) of both events A and B being true such that,

(→ called Commutative law – A ∩ B = B∩ A.)

P(B ⋂ A) = P(A ⋂ B)

=> P(A | B) P(B) = P(B | A) P(A)

P(A|B) = [P(B|A) P(A)] / P(B), where P(B) ≠ 0


84

A simplified Bayes theorem is Naive Bayes Supervised classification


algorithm
85

For BoW:

For code listings and databases: refer kaggle


https://www.kaggle.com/code/ankumagawa/sentimental-analysis-using-
naive-bayes-classifier/notebook

MLE in Bayes theorem:


Scenario 1: Training data is 6(H) and 4 (T)
86

Scenario 2: Training data is 4(H) and 0 (T)

Use the equations of MLE as above, and the Probability determined based
on the MLE conditions
p(H) =1;
p(T) = 1-p(H)
=> 0
There is a problem we see now. What is the problem?
In realistic sense, the probability of any event can never be
equalto 0. That means the distribution is skewed a lot
If the probability =0 in the Naive Bayes model that means/makes
all the features irrelevant.
This ruins our model.
SO, Let us apply Laplace smoothing.
Now, apply, Laplace Smoothing(Strength =1)
Scenario -2 can be reformulated as:
4 +1 (H) and 0 +1 (T)
That means , basically we say that we we pretend to see 1 more sample
in each category.
NOW, P(H)=5/6 based on MLE
87

Applying Laplace Smoothing (Strength =2) Scenario -2 can be


reformulated as:
4 +2 (H) and 0 +2 (T)
NOW, P(H)=6/8 based on MLE

Applying Laplace Smoothing (Strength → very large,) Scenario -2


can be reformulated as:
4+100000(H) and 0 + 100000(T)
P(H) →1/2

Laplace Smoothing is put in place to make sure that the distro


look more uniform

You might also like