ML Unit 1
ML Unit 1
• Machine Learning is the use and development of computer systems that are able to learn and adapt
without following explicit instructions, by using algorithms and statistical models to analyse and
draw inferences from patterns in data.
• It is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods
that leverage data to improve performance on some set of tasks.
• Learning algorithms work on the basis that strategies, algorithms, and inferences that worked well
in the past are likely to continue working well in the future.
• Machine learning programs can perform tasks without being explicitly programmed to do so. It
involves computers learning from data provided so that they carry out certain tasks. For simple
tasks assigned to computers, it is possible to program algorithms telling the machine how to execute
all steps required to solve the problem at hand; on the computer's part, no learning is needed.
• The discipline of machine learning employs various approaches to teach computers to accomplish
tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential
answers exist, one approach is to label some of the correct answers as valid.
• We may not be able to identify the process completely, but we believe we can construct a good
and useful approximation. That approximation may not explain everything, but may still be able to
account for some part of the data.
• We believe that though identifying the complete process may not be possible, we can still detect
certain patterns or regularities. This is the niche of machine learning. Such patterns may help us
understand the process, or we can use those patterns to make predictions.
• Application of machine learning methods to large databases is called data mining. In data mining,
a large volume of data is processed to construct a simple model with valuable use,
• Machine learning is not just a database problem; it is also a part of artificial intelligence. To be
intelligent, a system that is in a changing environment should have the ability to learn. If the system
can learn and adapt to such changes, the system designer need not foresee and provide solutions
for all possible situations.
• Machine learning uses the theory of statistics in building mathematical models, because the core
task is making inference from a sample.
• One of the applications of machine learning is basket analysis, which is finding associations
between products bought by customers:
• If people who buy X typically also buy Y, and if there is a customer who buys X and does not buy
Y, he or she is a potential Y customer.
• Once we find such customers, we can target them for cross-selling.
• In finding an association rule, we are interested in learning a conditional probability of the form
P(Y|X) where Y is the product we would like to condition on X, which is the product or the set of
products which we know that the customer has already purchased.
Let us say, going over our data, we calculate that P(chips | beer) = 0.7.
Then, we can define the rule: 70 percent of customers who buy beer also buy chips.
1.2.2. Classification
• It is important job for the bank to be able to predict in advance the risk associated with a loan,
which is the probability that the customer will default and not pay the whole amount back.
• This is both to make sure that the bank will make a profit and also to not inconvenience a customer
with a loan over his or her financial capacity.
• In credit scoring, the bank calculates the risk given the amount of credit and the information about
the customer.
• The information about the customer includes data we have access to and is relevant in calculating
his or her financial capacity—namely, income, savings, collaterals, profession, age, past financial
history, and so forth.
• The bank has a record of past loans containing such customer data and whether the loan was paid
back or not. From this data of particular applications, the aim is to infer a general rule coding the
association between a customer’s attributes and his risk.
• That is, the machine learning system fits a model to the past data to be able to calculate the risk for
a new application and then decides to accept or refuse it accordingly.
• This is an example of a classification problem where there are two classes: low-risk and high-risk
customers. The information about a customer makes up the input to the classifier whose task is to
assign the input to one of the two classes.
• After training with the past data, a classification rule learned may be of the form:
1.2.3 Regression
• First, regression analysis is widely used for prediction and forecasting, where its use has substantial
overlap with the field of machine learning. Second, in some situation’s regression analysis can be
used to infer causal relationships between the independent and dependent variables.
• In statistical modelling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a
'label' in machine learning parlance) and one or more independent variables (often called
'predictors', 'covariates', 'explanatory variables' or 'features').
• Let us say we want to have a system that can predict the price of a used car. Inputs are the car
attributes—brand, year, engine capacity, mileage, and other information—that we believe affect a
car’s worth.
• The output is the price of the car. Such problems where the output is a number are regression
problems.
• Let X denote the car attributes, Y be the price of the car.
• On surveying the past transactions, we can collect a training data and the machine learning program
fits a function to this data to learn Y as a function of X.
• The above diagram depicts a training dataset of used cars and the function fitted. For simplicity,
the mileage is taken as the only input attribute and a linear model is used.
• Here the fitted function is of the form
y = wx + w0
• Both regression and classification are supervised learning problems where there is an input, X, an
output, Y, and the task is to learn the mapping from the input to the output.
• The approach in machine learning is that we assume a model defined up to a set of parameters:
y = g (x |θ)
where g(x) is the model and θ are its parameters. Y is a number in regression and is a class code
(e.g., 0/1) in the case of classification.
• The machine learning program optimizes the parameters, θ, such that the approximation error is
minimized, that is, our estimates are as close as possible to the correct values given in the training
set.
• In the above diagram, the model is linear and w and w0 are the parameters optimized for best fit to
the training data.
• In supervised learning, the aim is to learn a mapping from the input to an output whose correct
values are provided by a supervisor. In unsupervised learning, there is no such supervisor and we
only have input data.
• The aim is to find the regularities in the input. There is a structure to the input space such that
certain patterns occur more often than others, and we want to see what generally happens and what
does not. In statistics, density estimation.
• One method for density estimation is clustering where the aim is to find clusters or groupings of
input.
• The aim is to find the regularities in the input.
• There is a structure to the input space such that certain patterns occur more often than others.
• In statistics, this is known as density estimation.
• Some of the applications are customer segmentation, image compression, document clustering,
bioinformatics.
• Reinforcement Learning enables an agent to learn in an interactive environment by trial and error
using feedback from its own actions and experiences.
• Though both supervised and reinforcement learning use mapping between input and output, unlike
supervised learning where the feedback provided to the agent is correct set of actions for
performing a task, reinforcement learning uses rewards and punishments as signals for positive and
negative behaviour.
• A good example is game playing. A robot navigating in an environment in search of a goal location
is another application area of reinforcement learning.
• One factor that makes reinforcement learning harder is when the system has unreliable and partial
sensory information.
• which denotes,
• Each car is represented by such an ordered pair (x, r) and the training set contains N such examples.
• where t indexes different examples in the set; it does not represent time or any such order.
• This figure explains the hypothesis class. The class of family car is a rectangle in the price-engine
power space.
• Our training data can now be plotted in the two-dimensional (x1, x2) space where each instance t is
a data point at coordinates (xt1, xt2) and its type, namely, positive versus negative, is given by rt.
• After further discussions with the expert and the analysis of the data, we may have reason to believe
that for a car to be a family car, its price and engine power should be in a certain range. That is,
(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2) for suitable values of p1, p2, e1, and e2.
• The empirical error is the proportion of training instances where predictions of h do not match the
required values given in X.
• The error of hypothesis h given the training set X is
• Generalization is explained as how well our hypothesis will correctly classify future examples that
are not part of the training set.
• One possibility is to find the most specific hypothesis, S, that is the tightest rectangle that includes
all the positive examples and none of the negative examples.
• The most general hypothesis, G, is the largest rectangle we can draw that includes all the positive
examples and none of the negative examples.
• Here, C is the actual class and h is our induced hypothesis. The point where C is 1 but h is 0 is a
false negative, and the point where C is 0 but h is 1 is a false positive. Other points—namely, true
positives and true negatives—are correctly classified.
• Given X, we can find S, or G, or any h from the version space and use it as our hypothesis, h. It
seems intuitive to choose h halfway between S and G; this is to increase the margin, which is the
distance between the boundary and the instances closest to it.
• We choose the hypothesis with the largest margin, for best separation. The shaded instances are
those that define (or support) the margin; other instances can be removed without affecting h.
1.3.2. Vapnik-Chervonenkis (VC) Dimension
• The maximum number of points that can be shattered by H is called the Vapnik-Chervonenkis
(VC) dimension of H, is denoted as VC(H), and measures the capacity of H. Let us say we have a
dataset containing N points.
• These N points can be labelled in 2N ways as positive and negative. Therefore, 2N different
learning problems can be defined by N data points.
• If for any of these problems, we can find a hypothesis h ∈ H that separates the positive examples
from the negative, then we say H shatters N points.
• We take two points, m and n. For these two points, there can be 22 distinct labels in binary
classification.
• We list these cases as follows:
• We can observe that for all the possible labelling variations of mm and nn. The model can divide
the points into two segments.
• This is where we can claim that our model successfully shattered two points in the dataset.
Consequently, the VC dimension for this model is 2 (for now).
• Similar to the testing above, the modal also works on three points, which bumps our VC dimension
to 3. Thus it gives 23 possibilities.
• Specifically, in cases like these an axis-aligned rectangle can shatter four points in two dimensions.
Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two dimensions, is four.
• In calculating the VC dimension, it is enough that we find four points that can be shattered; it is
not necessary that we be able to shatter any four points in two dimensions
• For example, four points placed on a line cannot be shattered by rectangles.
• However, we cannot place five points in two dimensions anywhere such that a rectangle can
separate the positive and negative examples for all possible labelling’s.
𝟏
m≥ 𝛜 (ln |H|+ln 𝟏𝛅)
• In this example, the analyser should make sure the that the probability of a positive example falling
in here (and causing an error) is at most ϵ.
• For any of these strips, if we can guarantee that the probability is upper bounded by ϵ /4, the
error is at most 4(ϵ /4) = ϵ.
• In this problem we count the overlaps in the corners twice, and the total actual error in this case is
less than 4(ϵ/4).
• The probability that a randomly drawn example misses this strip is 1 − ϵ /4.
• The probability that all N independent draws miss the strip is (1− ϵ / 4) N, and the probability that
all N independent draws miss any of the four strips is at most 4(1 − ϵ / 4) N, which we would like
to be at most δ.
• So if we choose N and δ such that we have, 4(1 − ϵ /4) N ≤ δ
1.3.4. Noise
• Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple model (hypothesis class).
• Humans are prone to making mistakes when collecting data, and data collection instruments may
be unreliable, resulting in dataset errors. The errors are referred to as noise.
• Data noise in machine learning can cause problems since the algorithm interprets the noise as a
pattern and can start generalizing from it.
• In our example of learning a family car, we have positive examples belonging to the class family
car and the negative examples belonging to all other cars. This is a two-class problem.
• In the above diagram, there are three classes: family car, sports car, and luxury sedan. There are
three hypotheses induced, each one covering the instances of one class and leaving outside the
instances of the other two classes. ‘?’ are reject regions where no, or more than one, class is chosen.
• In this case, learning multiple classes comes in frame.
• In the general case, we have K, classes denoted as Ci, i = 1, . . . , K, and an input instance belongs
to one and exactly one of them.
• The training set is now of the form X = {xt, rt}𝑁
𝑡=1 , where r has K dimensions and
• For a given x, ideally only one of hi(x), i = 1, . . ., K is 1 and we can choose a class. But when no,
or two or more, hi(x) is 1, we cannot choose a class, and this is the case of doubt and the classifier
rejects such cases.
• In our example of learning a family car, we used only one hypothesis and only modelled the
positive examples. Any negative example outside is not a family car.
• Alternatively, sometimes we may prefer to build two hypotheses, one for the positive and the other
for the negative instances.
• This assumes a structure also for the negative instances that can be covered by another hypothesis.
• Separating family cars from sports cars is such a problem; each class has a structure of its own.
• The advantage is that if the input is a luxury sedan, we can have both hypotheses decide negative
and reject the input.
• If in a dataset, we expect to have all classes with similar distribution— shapes in the input space—
then the same hypothesis class can be used for all classes.
• For example, in a handwritten digit recognition dataset, we would expect all digits to have similar
distributions.
• But in a medical diagnosis dataset, for example, where we have two classes for sick and healthy
people, we may have completely different distributions for the two classes; there may be multiple
ways for a person to be sick, reflected differently in the inputs: All healthy people are alike; each
sick person is sick in his or her own way.
1.5. Regression
• In classification, given an input, the output that is generated is Boolean; it is a yes/no answer.
• When the output is a numeric value, what we would like to learn is not a class, C(x) ∈ {0, 1}, but
is a numeric function.
• In machine learning, the function is not known but we have a training set of examples drawn from
it.
• X = {xt, rt}𝑁
𝑡=1
• where rt ∈ R, If there is no noise, the task is interpolation. We would like to find the function f (x)
that passes through these points such that we have
rt = f (xt)
• In polynomial interpolation, given N points, we find the (N−1) st degree polynomial that we can
use to predict the output for any x. This is called extrapolation.
• In regression, there is noise added to the output of the unknown function.
• Let us start with the case of learning a Boolean function from examples.
• In a Boolean function, all inputs and the output are binary.
• There are 2d possible ways to write d binary values and therefore, with d inputs, the training set
has at most 2d examples.
• In the above table, there is two inputs, there are four possible cases and sixteen possible Boolean
functions.
• As shown in table, each of these can be labeled as 0 or 1, and therefore, there are 2 2d possible
Boolean functions of d inputs.
• Each distinct training example removes half the hypotheses, namely, those whose guesses are
wrong.
• For example, let us say we have x1 = 0, x2 = 1 and the output is 0; this removes h5, h6, h7, h8, h13,
h14, h15, h16.
• This is one way to interpret learning: we start with all possible hypothesis and as we see more
training examples, we remove those hypotheses that are not consistent with the training data.
• In the case of a Boolean function, to end up with a single hypothesis we need to see all 2d training
examples.
• If the training set, we are given contains only a small subset of all possible instances, as it generally
does—that is, if we know what the output should be for only a small percentage of the cases—the
solution is not unique.
• After seeing N example cases, there remain 22d−N possible functions. This is an example of an ill-
posed problem where the data by itself is not sufficient to find a unique solution.
• So, because learning is ill-posed, and data by itself is not sufficient to find the solution, we should
make some extra assumptions to have a unique solution with the data we have. The set of
assumptions we make to have learning possible is called the inductive bias of the learning
algorithm.
• Thus, learning is not possible without inductive bias, and now the question is how to choose the
right bias. This is called model selection, which is choosing between possible H.
• In answering this question, we should remember that the aim of machine learning is rarely to
replicate the training data but the prediction for new cases.
• That is we would like to be able to generate the right output for an input instance outside the training
set, one for which the correct output is not given in the training set.
• How well a model trained on the training set predicts the right output for new instances is called
generalization.
• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data. If H is less complex than the function, we have
underfitting,
• If there is noise, an overcomplex hypothesis may learn not only the underlying function but also
the noise in the data and may make a bad fit. This is called overfitting.