ML Unit-1 Notes
ML Unit-1 Notes
LEARNING
The key concept that we will need to think about for our machines is learning from data. learning
from experience. Hopefully, we all agree that humans and other animals can display behaviours
that we label as intelligent by learning from experience. Learning is what gives us flexibility in
our life; the fact that we can adjust and adapt to new circumstances, and learn new tricks, no
matter how old a dog we are! The important parts of animal learning for this book are
remembering, adapting, and generalizing: recognizing that last time we were in this situation
(saw this data) we tried out some particular action (gave this output) and it worked (was correct).
The last word, generalizing, is about recognizing similarity between different situations, so that
things that applied in one place can be used in another. This is what makes learning useful,
because we can use our knowledge in lots of different places.
Of course, there are plenty of other bits to intelligence, such as reasoning, and logical deduction,
but we won’t worry too much about those. We are interested in the most fundamental parts of
intelligence—learning and adapting—and how we can model them in a computer. There has also
been a lot of interest in making computers reason and deduce facts. This was the basis of most
early Artificial Intelligence, and is sometimes known as symbolic processing because the
computer manipulates symbols that reflect the environment. In contrast, machine learning
methods are sometimes called sub-symbolic because no symbols or symbolic manipulation are
involved.
Machine Learning
Machine learning, then, is about making computers modify or adapt their actions (whether these
actions are making predictions, or controlling a robot) so that these actions get more accurate,
where accuracy is measured by how well the chosen actions reflect the correct ones. Imagine that
you are playing Scrabble (or some other game) against a computer. You might beat it every time
in the beginning, but after lots of games it starts beating you, until finally you never win. Either
you are getting worse, or the computer is learning how to win at Scrabble. Having learnt to beat
you, it can go on and use the same strategies against other players, so that it doesn’t start from
scratch with each new player; this is a form of generalization.
the inherent multi-disciplinarity of machine learning has been recognized and It merges ideas
from neuroscience and biology, statistics, mathematics, and physics, to make computers learn.
The computational complexity of the machine learning methods will also be of interest to us
since what we are producing is algorithms. It is particularly important because we might want to
use some of the methods on very large datasets, so algorithms that have high degree polynomial
complexity in the size of the dataset (or worse) will be a problem. The complexity is often
broken into two parts: the complexity of training, and the complexity of applying the trained
algorithm. Training does not happen very often, and is not usually time critical, so it can take
longer. However, we often want a decision about a test point quickly, and there are potentially
lots of test points when an algorithm is in use, so this needs to have low computational cost.
Definition of Machine Learning: “A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.”
Supervised learning A training set of examples with the correct responses (targets) is provided
and, based on this training set, the algorithm generalises to respond correctly to all possible
inputs. This is also called learning from exemplars.
Unsupervised learning Correct responses are not provided, but instead the algorithm tries to
identify similarities between the inputs so that inputs that have something in common are
categorized together. The statistical approach to unsupervised learning is known as density
estimation.
Reinforcement learning This is somewhere between supervised and unsupervised learning. The
algorithm gets told when the answer is wrong, but does not get told how to correct it. It has to
explore and try out different possibilities until it works out how to get the answer right.
Reinforcement learning is sometime called learning with a critic because of this monitor that
scores the answer, but does not suggest improvements.
SUPERVISED LEARNING
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labeled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples(data) so that the supervised learning algorithm
analyses the training data(set of training examples) and produces a correct outcome from
labeled data.
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top, is red in color, then it
will be labeled as –Apple.
If the shape of the object is a long curving cylinder having Green-Yellow color, then it will
be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time have to use
it wisely. It will first classify the fruit with its shape and color and would confirm the fruit
name as BANANA and put it in the Banana category. Thus the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning classified into two categories of algorithms:
Steps
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in it and the second part may contain all pics having cats in it.
Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Computational
Complexity Simpler method Computationally complex
In animals, learning occurs within the brain. If we can understand how the brain works, then
there might be things in there for us to copy and use for our machine learning systems. While the
brain is an impressively powerful and complicated system, the basic building blocks that it is
made up of are fairly simple and easy to understand.
In computational terms the brain does exactly what we want. It deals with noisy and even
inconsistent data, and produces answers that are usually correct from very high dimensional data
(such as images) very quickly. All amazing for something that weighs about 1.5 kg and is losing
parts of itself all the time (neurons die as you age at impressive/depressing rates), but its
performance does not degrade appreciably (in the jargon, this means it is robust).
The most basic level, which is the processing units of the brain. These are nerve cells called
neurons. There are lots of them (100 billion = 1011 is the figure that is often given) and they
come in lots of different types, depending upon their particular task. However, their general
operation is similar in all cases: transmitter chemicals within the fluid of the brain raise or lower
the electrical potential inside the body of the neuron. If this membrane potential reaches some
threshold, the neuron spikes or fires, and a pulse of fixed strength and duration is sent down the
axon. The axons divide (arborise) into connections to many other neurons, connecting to each of
these neurons in a synapse. Each neuron is typically connected to thousands of other neurons, so
that it is estimated that there are about 100 trillion (= 1014) synapses within the brain. After
firing, the neuron must wait for some time to recover its energy (the refractory period) before it
can fire again.
Each neuron can be viewed as a separate processor, performing a very simple computation:
deciding whether or not to fire. This makes the brain a massively parallel computer made up of
1011 processing elements. If that is all there is to the brain, then we should be able to model it
inside a computer and end up with animal or human intelligence inside a computer. This is the
view of strong AI.
We do want to make programs that learn. So how does learning occur in the brain? The principal
concept is plasticity: modifying the strength of synaptic connections between neurons, and
creating new connections. We don’t know all of the mechanisms by which the strength of these
synapses gets adapted, but one method that does seem to be used was first postulated by Donald
Hebb in 1949.
Hebb’s Rule
Hebb’s rule says that the changes in the strength of synaptic connections are proportional to the
correlation in the firing of the two connecting neurons. So if two neurons consistently fire
simultaneously, then any connection between them will change in strength, becoming stronger.
However, if the two neurons never fire simultaneously, the connection between them will die
away. The idea is that if two neurons both respond to something, then they should be connected.
Let’s see a trivial example: suppose that you have a neuron somewhere that recognizes your
grandmother (this will probably get input from lots of visual processing neurons, but don’t worry
about that). Now if your grandmother always gives you a chocolate bar when she comes to visit,
then some neurons, which are happy because you like the taste of chocolate, will also be
stimulated. Since these neurons fire at the same time, they will be connected together, and the
connection will get stronger over time. So eventually, the sight of your grandmother, even in a
photo, will be enough to make you think of chocolate. Sound familiar? Pavlov used this idea,
called classical conditioning, to train his dogs so that when food was shown to the dogs and the
bell was rung at the same time, the neurons for salivating over the food and hearing the bell fired
simultaneously, and so became strongly connected. Over time, the strength of the synapse
between the neurons that responded to hearing the bell and those that caused the salivation reflex
was enough that just hearing the bell caused the salivation neurons to fire in sympathy. There are
other names for this idea that synaptic connections between neurons and assemblies of neurons
can be formed when they fire together and can become stronger. It is also known as long-term
potentiation and neural plasticity, and it does appear to have correlates in real brains.
FIGURE 3.1 A picture of McCulloch and Pitts’ mathematical model of a neuron. The inputs xi
are multiplied by the weights wi, and the neurons sum their values. If this sum is greater than the
threshold _ then the neuron fires; otherwise it does not.
A picture of their model is given in Figure 3.1, and we’ll use the picture to write down a
mathematical description. On the left of the picture are a set of input nodes (labeled x1, x2, . . .
xm). These are given some values, and as an example we’ll assume that there are three inputs,
with x1 = 1, x2 = 0, x3 = 0.5. In real neurons those inputs come from the outputs of other
neurons. So the 0 means that a neuron didn’t fire, the 1 means it did, and the 0.5 has no
biological meaning, but never mind.
Each of these other neuronal firings flowed along a synapse to arrive at our neuron, and those
synapses have strengths, called weights. The strength of the synapse affects the strength of the
signal, so we multiply the input by the weight of the synapse (so we get x1 × w1 and x2 × w2,
etc.). Now when all of these signals arrive into our neuron, it adds them up to see if there is
enough strength to make it fire.
We’ll write that as
which just means sum (add up) all the inputs multiplied by their synaptic weights. I’ve assumed
that there are m of them, where m = 3 in the example. If the synaptic weights are w1 = 1, w2 =
−0.5,w3 = −1, then the inputs to our model neuron are h = 1 × 1 + 0 × −0.5 + 0.5 × −1 = 1 + 0 +
−0.5 = 0.5. Now the neuron needs to decide if it is going to fire. For a real neuron, this is a
question of whether the membrane potential is above some threshold. We’ll pick a threshold
value (labeled θ), say θ = 0 as an example. Now, does our neuron fire? Well, h = 0.5 in the
example, and 0.5 > 0, so the neuron does fire, and produces output 1. If the neuron did not fire, it
would produce output 0.
The McCulloch and Pitts neuron is a binary threshold device. It sums up the inputs (multiplied
by the synaptic strengths or weights) and either fires (produces output 1) or does not fire
(produces output 0) depending on whether the input is above some threshold. We can write the
second half of the work of the neuron, the decision about whether or not to fire (which is known
as an activation function), as:
Note that the weights wi can be positive or negative. This corresponds to excitatory and
inhibitory connections that make neurons more likely to fire and less likely to fire, respectively.
Both of these types of synapses do exist within the brain, but with the McCulloch and Pitts
neurons, the weights can change from positive to negative or vice versa, which has not been seen
biologically—synaptic connections are either excitatory or inhibitory, and never change from
one to the other. Additionally, real neurons can have synapses that link back to themselves in a
feedback loop, but we do not usually allow that possibility when we make networks of neurons.
Again, there are exceptions, but we won’t get into them. It is possible to improve the model to
include many of these features, but the picture is complicated enough already, and McCulloch
and Pitts neurons already provide a great deal of interesting behaviour that resembles the action
of the brain, such as the fact that networks of McCulloch and Pitts neurons can memorise
pictures and learn to represent functions and classify data.
Linear Discriminants is a statistical method of dimensionality reduction that provides the highest
possible discrimination among various classes, used in machine learning to find the linear
combination of features, which can separate two or more classes of objects with best
performance. It has been widely used in many applications, such as pattern recognition, image
retrieval, speech recognition, among others. The method is based on discriminant functions that
are estimated based on a set of data called training set. These discriminant functions are linear
with respect to the characteristic vector, and usually have the form
f(t)=wtx+b0,
where w represents the weight vector, x the characteristic vector, and b0 a threshold.
FIND S Algorithm is used to find the Maximally Specific Hypothesis. Using the Find-S
algorithm gives a single maximally specific hypothesis for the given set of training examples.
h0 = (ø, ø, ø, ø, ø, ø, ø)
Step 3
Solution: 2 * 3 * 2 * 2 * 3 = 72
2. How many hypotheses can be expressed by the hypothesis language?
Solution: 4 * 5 * 4 * 4 * 5 = 1600
Semantically Distinct Hypothesis = ( 3 * 4 * 3 * 3 * 4 ) + 1 = 433
3. Apply the FIND-S algorithm by hand on the given training set. Consider the examples in the
specified order and write down your hypothesis each time after observing an example.
Step 1:
h0 = (ø, ø, ø, ø, ø)
Step 2:
X1 = (some, small, no, expensive, many) – No
h1 = (ø, ø, ø, ø, ø)
h5 = (many, ?, no, ?, ?)
Step 3:
Final Maximally Specific Hypothesis is:
h5 = (many, ?, no, ?, ?)
Version Space and List-Then-Eliminate Algorithm
An hypothesis h is said to be consistent hypothesis with a set of training examples D iff h(x)
= c(x) for each example in D,
For Example:
h1 = (?, ?, No, ?, Many) – Consistent Hypothesis as it is consistent with all the training
examples
h2 = (?, ?, No, ?, ?) – Inconsistent Hypothesis as it is inconsistent with first training example
Version Space
The version space VSH,Dis the subset of the hypothesis from H consistent with the training
example in D,
List-Then-Eliminate algorithm
Example:
F1 – > A, B
F2 – > X, Y
Here F1 and F2 are two features (attributes) with two possible values for each feature or
attribute.
Instance Space: (A, X), (A, Y), (B, X), (B, Y) – 4 Examples
Hypothesis Space: (A, X), (A, Y), (A, ø), (A, ?), (B, X), (B, Y), (B, ø), (B, ?), (ø, X), (ø, Y),
(ø, ø), (ø, ?), (?, X), (?, Y), (?, ø), (?, ?) – 16 Hypothesis
Semantically Distinct Hypothesis : (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y
(?, ?), (ø, ø) – 10
Version Space: (A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?), (ø, ø),
•Training Instances
F1 F2 Target
A X Yes
A Y Yes
Candidate Elimination Algorithm is used to find the set of consistent hypothesis, that is Version
spsce.
Algorithm:
If d is positive example
Remove s from S
If d is negative example
Remove g from G
Solution:
The first example is positive, the hypothesis at the specific boundary is inconsistent, hence we
extend the specific boundary, and the hypothesis at the generic boundary is consistent hence we
retain it.
The second example in positive, again the hypothesis at the specific boundary is inconsistent,
hence we extend the specific boundary, and the hypothesis at the generic boundary is consistent
hence we retain it.
The third example is negative, the hypothesis at the specific boundary is consistent, hence we
retain it, and hypothesis at the generic boundary is inconsistent hence we write all consistent
hypotheses by removing one “?” (question mark) at time.
Candidate Elimination Algorithm is used to find the set of consistent hypothesis, that is Version
spsce.
Algorithm:
If d is positive example
Remove s from S
If d is negative example
Remove g from G
Solution:
The first example is negative, the hypothesis at the specific boundary is consistent, hence we
retain it, and the hypothesis at the generic boundary is inconsistent hence we write all consistent
hypotheses by removing one “?” at a time.
S1: (0, 0, 0)
The second example is negative, the hypothesis at the specific boundary is consistent, hence we
retain it, and the hypothesis at the generic boundary is inconsistent hence we write all consistent
hypotheses by removing one “?” at a time.
S2: (0, 0, 0)
G2: (Small, Blue, ?), (Small, ?, Circle), (?, Blue, ?), (Big, ?, Triangle), (?, Blue, Triangle)
The third example is positive, the hypothesis at the specific boundary is inconsistent, hence we
extend the specific boundary, and the consistent hypothesis at the generic boundary is retained
and inconsistent hypotheses are removed from the generic boundary.
The fourth example is negative, the hypothesis at the specific boundary is consistent, hence we
retain it, and the hypothesis at the generic boundary is inconsistent hence we write all consistent
hypotheses by removing one “?” at a time.
The fifth example is positive, the hypothesis at the specific boundary is inconsistent, hence we
extend the specific boundary, and the consistent hypothesis at the generic boundary is retained
and inconsistent hypotheses are removed from the generic boundary.
Learned Version Space by Candidate Elimination Algorithm for given data set is:
S: G: (Small, ?, Circle)
Linear Regression
In Machine Learning,
The sloped straight line representing the linear relationship that fits the given data best is
called as a regression line.
It is also called as best fit line.
In simple linear regression, the dependent variable depends only on a single independent variable.
Y = β0 + β1X
Here,
Y is a dependent variable.
X is an independent variable.
β0 and β1 are the regression coefficients.
β0 is the intercept or the bias that fixes the offset to a line.
β1 is the slope or weight that specifies the factor by which X has an impact on Y.
Case-01: β1 < 0
Case-02: β1 = 0
It indicates that variable X has no impact on Y.
If X changes, there will be no change in Y.
Case-03: β1 > 0
In multiple linear regression, the dependent variable depends on more than one independent
variables.
For multiple linear regression, the form of the model is-
Here,
Y is a dependent variable.
X1, X2, …., Xn are independent variables.
β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies the factor by which Xj has an impact on Y.