ML - Unit 1
ML - Unit 1
Machine learning1
1
2
GenAI is all about generating fresh content and information. It learns from
3
existing data, identifies patterns and structures, and then uses this knowledge
Applications:
Used in a wide range of industries. Banking industries. Identify fraud;
processing loans. Health care industries: assess the possibility of a person
4
running into the Health issues. smart devices quickly process human
conversations through
1. SUPERVISED LEARNING
Data must be divided into features (the input data) and labels (the output
data).
Classification
Regression
Regression algorithms are frequently used tools for forecasting trends. These
algorithms identify relationships between outcomes and other independent
variables to make accurate predictions. Linear regression algorithms are the
most widely used, but other commonly used regression algorithms include
logistic regressions
2. UNSUPERVISED LEARNING
UnsupervisedLearning
7
No class labels.
Input data = X (ind. Pendent var)
Output data = Y(dependent var)
Y = f(X)
We determine how dependent var change with ind. Variables
Even though there is no class labels, we are able divide into clusters consisting of similar items
9
Clustering
These algorithms focus on similarities within raw data, and then groups that
information accordingly. More simply, these algorithms provide structure to
raw data. Clustering algorithms are often used with marketing data to garner
customer (or potential customer) insights, as well as for fraud detection.
2) k-means clustering.
Dimensionality Reduction
This helps simplify the model and improve the accuracy of outputs
3. SEMI-SUPERVISED LEARNING
Label Propagation
4.REINFORCEMENT LEARNING
that achieve a desired outcome. (Think simulations, computer games and the
real world.)
These are some of the algorithms that fall under reinforcement learning.
Q-Learning
Used in the development of self-driving cars, video games and robots, deep
reinforcement learning combines deep learning — machine learning based on
14
Reinforcement learning
Supervised Learning
Logistic regression:
Overview of Classification
16
Classification in machine learning can require two or more categories of a given data
set. Therefore, it generates a probability score to assign the data into a specific
category, such as spam or not spam, yes or no, disease or no disease, red or green
● Image classification
● Fraud detection
● Document classification
● Spam filtering
● Facial recognition
● Voice recognition
● Product categorization
Multi-class is a type of classification problem with more than two outcomes and does
not have the concept of normal and abnormal outcomes. Here each outcome is
assigned to only one label. For example, classifying images, classifying species, and
categorizing faces, among others. Some common multi-class algorithms are choice
trees, progressive boosting, nearest k neighbors, and rough forest.
3) Multi-Label Classification
Multi-label is a type of classification problem that may have more than one class
label assigned to the data. Here the model will have multiple outcomes. For
example, a book or a movie can be categorized into multiple genres, or an image
can have multiple objects. Some common multi-label algorithms are multi-label
decision trees, multi-label gradient boosting, and multi-label random forests.
4) Imbalanced Classification
Most machine learning algorithms assume equal data distribution. When the
data distribution is not equal, it leads to imbalance. An imbalanced classification
problem is a classification problem where the distribution of the dataset is
skewed or biased. This method employs specialized techniques to change the
composition of data samples. Some examples of imbalanced classification are
spam filtering, disease screening, and fraud detection.
Fig 3: Logit Function heads to infinity as p approaches 1 and towards negative infinity as it
approaches 0.
That is why the log odds are used to avoid modeling a variable with a
restricted range such as probability.
Logit()
and Sigmoid()
The logit function maps probabilities to the full range of real numbers
The inverse of the logit function is the sigmoid function. That is, if
function maps arbitrary real values back to the range [0, 1]. We can
We are building the model with the primary aim to reduce the error
26
How?
27
1)
2)
28
29
30
1
Log ( ) = log(1) – log(0) = 0 – ( – infinity) = +Infinity
0
31
Log b 0 = c
c
=> 0= b
If b<1 c = (extremely large) → + infinity
1000 100
Compare, (0.1) and (0.1) more the exponent value, the
small it is. So , c = + ∞
Is this the best curve? We should determine based on the maximum likelihood comes into
picture
37
Box Plots
41
42
Decision Trees
43
44
45
46
47
Perceptron
There are differences between the Biological Neuron and Artificial neuron.
Inputs: The inputs to a real neuron are not necessarily summed linearly:
there may be non-linear summations.
Real neurons do not output a single output response, but a spike train, that
is, a sequence of pulses, and it is this spike train that encodes information.
This means that neurons don’t actually respond as threshold devices, but
produce a graded output in a continuous way.
They do still have the transition between firing and not firing, though, but
the threshold at which they fire changes over time.
the neurons are not updated sequentially according to a computer clock,
but update themselves randomly (asynchronously).
However, before we get to that, the learning rule needs to be finished—we need
to decide how much to change the weight by. This is done by multiplying the
value above by a parameter called the learning rate, usually labeled as η. The
value of the learning rate decides how fast the network learns. It’s quite
important, so it gets a little subsection of its own (next), but first let’s write down
1.Initialization:
Small random wts are considered to start with.
Wts refer to the relative importance of the inputs
2.Training:
for T iterations or until all the outputs are correct: ∗ for each input vector: ·
compute the activation of each neuron j using activation function g:
52
3.recall
compute the activation of each neuron j using: yj = g Xm i=0 wijxi ! = 1 if wijxi > 0 0 if wijxi ≤ 0
Computing the computational complexity of this algorithm is very easy. The recall phase
In1 In2 target
0 0 0
0 1 1
1 0 1
1 1 1
Data for the OR logic function and a plot of the four data points
we’ll pick w0 = −0.05, w1 = −0.02, w2 = 0.02. Now we feed in the first input, where both inputs
are 0: (0, 0). Remember that the input to the bias weight is always −1, so the value that reaches
the neuron is −0.05 × −1 + −0.02 × 0 + 0.02 × 0 = 0.05. This value is above 0, so the neuron
fires and the output is 1, which is incorrect according to the target.
The update rule tells us that we need to apply Equation (1) to each of the weights separately
(we’ll pick a value of η = 0.25 for the example):
(Bias) w0 : −0.05 − 0.25 × (1 − 0) × −1 = 0.2,
(input1) w1 : −0.02 − 0.25 × (1 − 0) × 0 = −0.02,
(input2) w2 : 0.02 − 0.25 × (1 − 0) × 0 = 0.02.
53
Now we feed w0,w1,w2 in the next input (0, 1) and compute the output (check that you agree
that the neuron does not fire, but that it should) and then apply the learning rule again:
w0 : 0.2 − 0.25 × (0 − 1) × −1 = −0.05,
w1 : −0.02 − 0.25 × (0 − 1) × 0 = −0.02,
w2 : 0.02 − 0.25 × (0 − 1) × 1 = 0.27.
For the (1,0) input the answer is already correct. so we don’t have to update the weights at all,
and the same is true for the (1,1) input.
LINEAR SEPARABILITY
What does the Perceptron actually compute? For our one output neuron example of the OR
data it tries to separate out the cases where the neuron should fire from those where it
shouldn’t.
In fact, that is exactly what the Perceptron does: it tries to find a straight line (in 2D, a plane in
3D, and a hyperplane in higher dimensions) where the neuron fires on one side of the line, and
doesn’t on the other. This line is called the decision boundary or discriminant function,
matrix notation. consider just one input vector x. The neuron fires if x·wT ≥ 0 (where w is
the row of W that connects the inputs to one particular neuron; x is the input Vector
Getting back to the Perceptron, the boundary case is where we find an input vector x1 that has
x1 · wT = 0. Now suppose that we find another input vector x2 that satisfies x2 · wT =0.
Putting these two equations together we get:
T
x1.w = 0
For circles class, => dot product
x1.w T = x2.w T
Or (x1-x2).w T = 0
|| x1 - x2 || . w T . cos (theta) = 0
55
Now x1 − x2 is a straight line between two points that lie on the decision boundary, and the
T
weight vector w must be perpendicular to that
How to draw a linear decision boundary for this set up , which separates these 2
classes?
From going to 3D the model takes a non linear pattern
Introduce a new dimension, now it is 3 D which makes easy for us to linearly separable
model
59
60
Perceptron Training
This is less sensitive and also, if you do not take the abs values of the errors, that causes the
problem.
sum-of-squares error function is the right candidate.
if we differentiate a function, then it tells us the gradient of that function, which is the direction
along which it increases and decreases the most.
So if we differentiate an error function, we get the gradient of the error
Since the purpose of learning is to minimize the error, following the error function downhill (in
other words, in the direction of the negative gradient) will give us what we want.
61
Local minimas at 1 and 2. The weights of the network are trained so that the error goes downhill
until it reaches a local minimum, just like a ball rolling under gravity.
Gravity will make the ball roll downhill (follow the downhill gradient) until it ends up in the bottom
of one of the hollows. These are places where the error is small, so that is exactly what we
want. This is why the algorithm is called gradient descent.
For the gradient descent algorithm to reach the local minimum we must set
the learning rate to an appropriate value, which is neither too low nor too
high. This is important because if the steps it takes are too big, it may not
reach the local minimum because it bounces back and forth between the
convex function of gradient descent (see left image below). If we set the
learning rate to a very small value, gradient descent will eventually reach the
local minimum but that may take a while (see the right image).
62
So what should we differentiate with respect to? There are only three things in the network that
change: the inputs, the activation function that decides whether or not the node fires, and the
weights. The first and second are out of our control when the algorithm is running, so only the
weights matter, and therefore they are what we differentiate with respect to
1.The threshold function that we used for the Perceptron. Note the discontinuity where the value
changes from 0 to 1.
2. The sigmoid function, which looks qualitatively fairly similar, but varies smoothly and
differentiably.
63
64
65
66
Error in Prediction:
Pass 1:
67
Pass 2:
Pass: 3
68
Probability
A measure of uncertainty
Uncertainty refers to the lack of complete knowledge about the world
or a specific situation. For example, if we're unsure whether a
condition A is true or false, we can't definitively conclude the outcome
of another condition B based on A. This uncertainty arises from factors
like incomplete information or ambiguity. To address such uncertainty,
we employ techniques like uncertain reasoning or probabilistic
reasoning, which allow us to make informed decisions even when
dealing with incomplete or uncertain knowledge.
69
Joint Probability
Conditional Probability
1. Bayes' rule
2. Bayesian Statistics
Bayes' Rule
Bayes' rule is known as the basic rule of probability.
(Thomas Bayes, an 18 th century Mathematician)
It helps the model to update itself along with the prior knowledge and
change the new evidence. We use the Bayes' Rule in many fields
nowadays, like for prediction, classification, and decision-making jobs
where uncertainty needs to be handled.
The mathematical formula of Bayes' Rule is as follows.
P (A | B) = P ( B | A) * P (A) / P (B)
Ex 2:
The probability of a student passing in science is ⅘ and the of the
student passing in both science and mathematics is ½. What is the
73
Solution:
=½÷⅘=⅝
∴ the probability of passing in maths is ⅝.
Joint probability: when both of the events are occurring at the same
time, then the probability of A and B is:
P(A and B) = P(A ⋂ B)
74
Question: You throw a 6-sided dice. You have a partial info of the roll
that it is an even number. How likely is it a 6?
We can now say that if all outcomes are equally likely, as is the case of
The probability of occurrence of any event A when another event B in relation to A has
already occurred is known as conditional probability.
It is denoted by P(A|B).
Sample space is given by Universe U, and there are two events A and B. In a situation
where event B has already occurred, then our sample space U naturally gets reduced to
B because now the chances of occurrence of an event will lie inside B.
Question:
Event A represents eating Cake for breakfast, event B represents eating Pizza for
lunch.
On a randomly selected day, probability of eating a cake for breakfast is 0.6,
the probability of eating a pizza for lunch is 0.5. the conditional probability that having a
cake for breakfast , given that having pizza for lunch is 0.7
based on this info, find the conditional probability having pizza for lunch,
given that cake is consumed for breakfast?
P(A | B) = 0.7 P( B | A) = ?
(c) How likely is it that a person has HIV-positive given that the
person was tested negative for HIV?
learning models.
Now, you feed a new picture to your machine learning model. Given
Find,
P( image actually is a chihuahua | image predicted as a chihuahua)
= 3/4
80
Steps:
total.
simplifies to 3/4.
81
learning. A confusion matrix is a table that tells us how well our model
A confusion matrix is a table with two rows and two columns that
A confusion Matrix
Precision =
Recall =
Specificity =
Bayes theorem:
Bayes theorem is used to determine the conditional probability of
event A when event B has already happened.
* (Thomas Bayes, an 18 th century Mathematician)
Bayes Theorem can be derived for events and random variables separately using the
definition of conditional probability
83
From the definition of conditional probability, Bayes theorem can be derived for events
as given below:
Here, the probability P(A ⋂ B) of both events A and B being true such that,
P(B ⋂ A) = P(A ⋂ B)
For BoW:
Use the equations of MLE as above, and the Probability determined based
on the MLE conditions
p(H) =1;
p(T) = 1-p(H)
=> 0
There is a problem we see now. What is the problem?
In realistic sense, the probability of any event can never be
equalto 0. That means the distribution is skewed a lot
If the probability =0 in the Naive Bayes model that means/makes
all the features irrelevant.
This ruins our model.
SO, Let us apply Laplace smoothing.
Now, apply, Laplace Smoothing(Strength =1)
Scenario -2 can be reformulated as:
4 +1 (H) and 0 +1 (T)
That means , basically we say that we we pretend to see 1 more sample
in each category.
NOW, P(H)=5/6 based on MLE
87