0% found this document useful (0 votes)
9 views17 pages

ML Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

ML Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1.1.

What Is Machine Learning:

• Machine Learning is the use and development of computer systems that are able to learn and adapt
without following explicit instructions, by using algorithms and statistical models to analyse and
draw inferences from patterns in data.
• It is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods
that leverage data to improve performance on some set of tasks.
• Learning algorithms work on the basis that strategies, algorithms, and inferences that worked well
in the past are likely to continue working well in the future.
• Machine learning programs can perform tasks without being explicitly programmed to do so. It
involves computers learning from data provided so that they carry out certain tasks. For simple
tasks assigned to computers, it is possible to program algorithms telling the machine how to execute
all steps required to solve the problem at hand; on the computer's part, no learning is needed.
• The discipline of machine learning employs various approaches to teach computers to accomplish
tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential
answers exist, one approach is to label some of the correct answers as valid.
• We may not be able to identify the process completely, but we believe we can construct a good
and useful approximation. That approximation may not explain everything, but may still be able to
account for some part of the data.
• We believe that though identifying the complete process may not be possible, we can still detect
certain patterns or regularities. This is the niche of machine learning. Such patterns may help us
understand the process, or we can use those patterns to make predictions.
• Application of machine learning methods to large databases is called data mining. In data mining,
a large volume of data is processed to construct a simple model with valuable use,
• Machine learning is not just a database problem; it is also a part of artificial intelligence. To be
intelligent, a system that is in a changing environment should have the ability to learn. If the system
can learn and adapt to such changes, the system designer need not foresee and provide solutions
for all possible situations.
• Machine learning uses the theory of statistics in building mathematical models, because the core
task is making inference from a sample.

1.2. Examples of Machine Learning Applications

1.2.1. Learning Associations

• One of the applications of machine learning is basket analysis, which is finding associations
between products bought by customers:
• If people who buy X typically also buy Y, and if there is a customer who buys X and does not buy
Y, he or she is a potential Y customer.
• Once we find such customers, we can target them for cross-selling.
• In finding an association rule, we are interested in learning a conditional probability of the form
P(Y|X) where Y is the product we would like to condition on X, which is the product or the set of
products which we know that the customer has already purchased.

Let us say, going over our data, we calculate that P(chips | beer) = 0.7.

Then, we can define the rule: 70 percent of customers who buy beer also buy chips.

1.2.2. Classification

• It is important job for the bank to be able to predict in advance the risk associated with a loan,
which is the probability that the customer will default and not pay the whole amount back.
• This is both to make sure that the bank will make a profit and also to not inconvenience a customer
with a loan over his or her financial capacity.
• In credit scoring, the bank calculates the risk given the amount of credit and the information about
the customer.
• The information about the customer includes data we have access to and is relevant in calculating
his or her financial capacity—namely, income, savings, collaterals, profession, age, past financial
history, and so forth.
• The bank has a record of past loans containing such customer data and whether the loan was paid
back or not. From this data of particular applications, the aim is to infer a general rule coding the
association between a customer’s attributes and his risk.
• That is, the machine learning system fits a model to the past data to be able to calculate the risk for
a new application and then decides to accept or refuse it accordingly.
• This is an example of a classification problem where there are two classes: low-risk and high-risk
customers. The information about a customer makes up the input to the classifier whose task is to
assign the input to one of the two classes.
• After training with the past data, a classification rule learned may be of the form:

IF income> θ1 AND savings> θ2 THEN low-risk ELSE high-risk

• This is an example of a discriminant. Discriminant is a function that separates the examples of


different
• classes.
• Having a rule like this, the main application is prediction: Once we have a rule that fits the past
data, if the future is similar to the past, then we can make correct predictions for novel instances.
• In some cases, instead of making a 0/1 (low-risk/high-risk) type decision, we may want to calculate
a probability, namely, P(Y|X), where X are the customer attributes and Y is 0 or 1 respectively for
low-risk and high-risk.
• There are many applications of machine learning in pattern recognition. recognition One is optical
character recognition, which is recognizing character codes from their images. This is an example
where there are multiple classes, as many as there are characters we would like to recognize.
Especially interesting is the case when the characters are handwritten—for example, to read zip
codes on envelopes or amounts on checks.
• If we are reading a text, one factor we can make use of is the redundancy in human languages. A
word is a sequence of characters and successive characters are not independent but are constrained
by the words of the language.
• Such contextual dependencies may also occur in higher levels, between words and sentences,
through the syntax and semantics of the language. There are machine learning algorithms to learn
sequences and model such dependencies.
• In the case of face recognition, the input is an image, the classes are people to be recognized, and
the learning program should learn to associate the face images to identities.
• This problem is more difficult than optical character recognition because there are more classes,
input image is larger, and a face is three-dimensional and differences in pose and lighting cause
significant changes in the image.
• There may also be occlusion of certain inputs; for example, glasses may hide the eyes and
eyebrows, and a beard may hide the chin.
• In medical diagnosis, the inputs are the relevant information we have about the patient and the
classes are the illnesses. The inputs contain the patient’s age, gender, past medical history, and
current symptoms. Some tests may not have been applied to the patient, and thus these inputs would
be missing.
• In the case of a medical diagnosis, a wrong decision may lead to a wrong or no treatment, and in
cases of doubt it is preferable that the classifier reject and defer decision to a human expert.
• In speech recognition, the input is acoustic and the classes are words that can be uttered.
• This time the association to be learned is from an acoustic signal to a word of some language.
Different people, because of differences in age, gender, or accent, pronounce the same word
differently, which makes this task rather difficult.
• Another difference of speech is that the input is temporal; words are uttered in time as a sequence
of speech phonemes and some words are longer than others.
• Acoustic information only helps up to a certain point, and as in optical character recognition, the
integration of a “language model” is critical in speech recognition, and the best way to come up
with a language model is again by learning it from some large corpus of example data.
• The applications of machine learning to natural language processing is constantly increasing.
• Spam filtering is one where spam generators on one side and filters on the other side keep finding
more and more ingenious ways to outdo each other. Perhaps the most impressive would-be machine
translation.
• After decades of research on hand-coded translation rules, it has become apparent recently that the
most promising way is to provide a very large number of example pairs of translated texts and have
a program figure out automatically the rules to map one string of characters to another.
• Biometrics is recognition or authentication of people using their physiological and/or behavioural
characteristics that requires an integration of inputs from different modalities. Examples of
physiological characteristics are images of the face, fingerprint, iris, and palm; examples of
behavioural characteristics are dynamics of signature, voice, gait, and key stroke.
• Learning a rule from data also allows knowledge extraction. The rule is a simple model that
explains the data, and looking at this model we have an explanation about the process underlying
the data.
• For example, once we learn the discriminant separating low-risk and high-risk customers, we have
the knowledge of the properties of low-risk customers. We can then use this information to target
potential low-risk customers more efficiently, for example, through advertising.
• Learning also performs compression in that by fitting a rule to the data, we get an explanation that
is simpler than the data, requiring less memory to store and less computation to process.
• Another use of machine learning is outlier detection, which is finding the instances that do not obey
the rule and are exceptions. In this case, after learning the rule, we are not interested in the rule but
the exceptions not covered by the rule, which may imply anomalies requiring attention— for
example, fraud.

1.2.3 Regression

Regression analysis is primarily used for two conceptually distinct purposes.

• First, regression analysis is widely used for prediction and forecasting, where its use has substantial
overlap with the field of machine learning. Second, in some situation’s regression analysis can be
used to infer causal relationships between the independent and dependent variables.
• In statistical modelling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a
'label' in machine learning parlance) and one or more independent variables (often called
'predictors', 'covariates', 'explanatory variables' or 'features').
• Let us say we want to have a system that can predict the price of a used car. Inputs are the car
attributes—brand, year, engine capacity, mileage, and other information—that we believe affect a
car’s worth.
• The output is the price of the car. Such problems where the output is a number are regression
problems.
• Let X denote the car attributes, Y be the price of the car.
• On surveying the past transactions, we can collect a training data and the machine learning program
fits a function to this data to learn Y as a function of X.
• The above diagram depicts a training dataset of used cars and the function fitted. For simplicity,
the mileage is taken as the only input attribute and a linear model is used.
• Here the fitted function is of the form

y = wx + w0

• Both regression and classification are supervised learning problems where there is an input, X, an
output, Y, and the task is to learn the mapping from the input to the output.
• The approach in machine learning is that we assume a model defined up to a set of parameters:

y = g (x |θ)

where g(x) is the model and θ are its parameters. Y is a number in regression and is a class code
(e.g., 0/1) in the case of classification.

• The machine learning program optimizes the parameters, θ, such that the approximation error is
minimized, that is, our estimates are as close as possible to the correct values given in the training
set.
• In the above diagram, the model is linear and w and w0 are the parameters optimized for best fit to
the training data.

1.2.4 Unsupervised Learning

• In supervised learning, the aim is to learn a mapping from the input to an output whose correct
values are provided by a supervisor. In unsupervised learning, there is no such supervisor and we
only have input data.
• The aim is to find the regularities in the input. There is a structure to the input space such that
certain patterns occur more often than others, and we want to see what generally happens and what
does not. In statistics, density estimation.
• One method for density estimation is clustering where the aim is to find clusters or groupings of
input.
• The aim is to find the regularities in the input.
• There is a structure to the input space such that certain patterns occur more often than others.
• In statistics, this is known as density estimation.
• Some of the applications are customer segmentation, image compression, document clustering,
bioinformatics.

1.2.5. Reinforcement Learning

• Reinforcement Learning enables an agent to learn in an interactive environment by trial and error
using feedback from its own actions and experiences.
• Though both supervised and reinforcement learning use mapping between input and output, unlike
supervised learning where the feedback provided to the agent is correct set of actions for
performing a task, reinforcement learning uses rewards and punishments as signals for positive and
negative behaviour.
• A good example is game playing. A robot navigating in an environment in search of a goal location
is another application area of reinforcement learning.
• One factor that makes reinforcement learning harder is when the system has unreliable and partial
sensory information.

1.3. Supervised Learning

1.3.1. Learning a Class from Examples

• Let us say we want to learn the class, C, of a “family car.”


• The people look at the cars and label them; the cars that they believe are family cars are positive
examples and the other cars are negative example.
• Class learning is finding a description that is shared by all positive examples and none of the
negative examples.
• Doing this, the car dealer and the family can make a prediction: Given a car that we have not seen
before, by checking with the description learned, we will be able to say whether it is a family car
or not.
• Suppose we take that the most dominant attribute that determine the family car is the Engine power
and the price.
• These attributes, are the inputs to the class recognizer.
• When we decide on this particular input representation, we are ignoring various other attributes as
irrelevant.
• Training set for the class of a “family car.” Each data point corresponds to one example car, and
the coordinates of the point indicate the price and engine power of that car. ‘+’ denotes a positive
example of the class (a family car), and ‘−’ denotes a negative example (not a family car); it is
another type of car.
• Let us denote price as the first input attribute x1.
• Engine power as the second attribute x2.
• Thus, we represent each car using two numeric values

• which denotes,
• Each car is represented by such an ordered pair (x, r) and the training set contains N such examples.

• where t indexes different examples in the set; it does not represent time or any such order.
• This figure explains the hypothesis class. The class of family car is a rectangle in the price-engine
power space.
• Our training data can now be plotted in the two-dimensional (x1, x2) space where each instance t is
a data point at coordinates (xt1, xt2) and its type, namely, positive versus negative, is given by rt.
• After further discussions with the expert and the analysis of the data, we may have reason to believe
that for a car to be a family car, its price and engine power should be in a certain range. That is,

(p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤ e2) for suitable values of p1, p2, e1, and e2.

• Thus, here we assume C to be a rectangle in the price-engine power space.


• The hypothesis class is which the class C is drawn, namely, the set of rectangles.
• The learning algorithm then finds the particular hypothesis, h ∈H, to approximate the class C as
closely as possible.
• The aim is to find h ∈ H that is as similar as possible to C. Let us say the hypothesis h makes a
prediction for an instance x such that

• The empirical error is the proportion of training instances where predictions of h do not match the
required values given in X.
• The error of hypothesis h given the training set X is

• Generalization is explained as how well our hypothesis will correctly classify future examples that
are not part of the training set.
• One possibility is to find the most specific hypothesis, S, that is the tightest rectangle that includes
all the positive examples and none of the negative examples.
• The most general hypothesis, G, is the largest rectangle we can draw that includes all the positive
examples and none of the negative examples.
• Here, C is the actual class and h is our induced hypothesis. The point where C is 1 but h is 0 is a
false negative, and the point where C is 0 but h is 1 is a false positive. Other points—namely, true
positives and true negatives—are correctly classified.
• Given X, we can find S, or G, or any h from the version space and use it as our hypothesis, h. It
seems intuitive to choose h halfway between S and G; this is to increase the margin, which is the
distance between the boundary and the instances closest to it.

• Here, S is the most specific and G is the most general hypothesis.


• In some applications, a wrong decision may be very costly and in such a case, we can say that any
instance that falls in between S and G is a case of doubt, which we cannot label with certainty due
to lack of data.

• We choose the hypothesis with the largest margin, for best separation. The shaded instances are
those that define (or support) the margin; other instances can be removed without affecting h.
1.3.2. Vapnik-Chervonenkis (VC) Dimension

• The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive


power, richness, or flexibility) of a set of functions that can be learned by a statistical binary
classification algorithm.
• It is defined as the cardinality of the largest set of points that the algorithm can shatter, which
means the algorithm can always learn a perfect classifier for any labelling of at least one
configuration of those data points.
• Shattering is the ability of a model to classify a set of points perfectly. More generally, the model
can create a function that can divide the points into two distinct classes without overlapping.
• Let us consider a simple binary classification model, which states that for all points (a, b), such
that a < x < b, label them as 1, otherwise, label them as 0.

• The maximum number of points that can be shattered by H is called the Vapnik-Chervonenkis
(VC) dimension of H, is denoted as VC(H), and measures the capacity of H. Let us say we have a
dataset containing N points.
• These N points can be labelled in 2N ways as positive and negative. Therefore, 2N different
learning problems can be defined by N data points.
• If for any of these problems, we can find a hypothesis h ∈ H that separates the positive examples
from the negative, then we say H shatters N points.
• We take two points, m and n. For these two points, there can be 22 distinct labels in binary
classification.
• We list these cases as follows:

• We can observe that for all the possible labelling variations of mm and nn. The model can divide
the points into two segments.
• This is where we can claim that our model successfully shattered two points in the dataset.
Consequently, the VC dimension for this model is 2 (for now).
• Similar to the testing above, the modal also works on three points, which bumps our VC dimension
to 3. Thus it gives 23 possibilities.

• However, when we reach four points, we run into an issue.

• Specifically, in cases like these an axis-aligned rectangle can shatter four points in two dimensions.
Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two dimensions, is four.
• In calculating the VC dimension, it is enough that we find four points that can be shattered; it is
not necessary that we be able to shatter any four points in two dimensions
• For example, four points placed on a line cannot be shattered by rectangles.
• However, we cannot place five points in two dimensions anywhere such that a rectangle can
separate the positive and negative examples for all possible labelling’s.

1.3.3. Probably Approximately Correct (PAC) Learning

• PAC stands for Probably Approximately Correct.


• It is the machine learning approach that offers a probability solution for a given problem and this
solution tends to be approximately correct.
• Probably approximately correct (PAC) learning theory helps analyze whether and under what
conditions a learner L will probably get the output, on an approximately correct classifier.
• First, let's define "approximate." A hypothesis h ∈ H is approximately correct if its error over the
𝟏
distribution of inputs is bounded by some ϵ,0 ≤ ϵ ≤ 𝟐. That is errorD(h)<ϵ, where D is the

distribution over inputs.


𝟏
• Next, "probably." If L will output such a classifier ϵ with probability 1−δ, with 0 ≤ δ ≤ 𝟐, we call

that classifier probably approximately correct.


• Knowing that a target concept is PAC-learnable allows you to bound the sample size necessary to
probably learn an approximately correct classifier, which is what's shown in the formula you've
reproduced:

𝟏
m≥ 𝛜 (ln |H|+ln 𝟏𝛅)

o where, m is the given model.


o H is the representational complexity.
o ϵ is the error rate of the given model.
o δ is the failure probability of the given model.
• To gain some intuition about this, note the effects on m when you alter variables in the right-hand
side.
• As allowable error decreases, the necessary sample size grows. Likewise, it grows with the
probability of an approximately correct learner, and with the size of the hypothesis space H.
• As we consider more possible classifiers, or desire a lower error or higher probability of
correctness, we need more data to distinguish between them.
• Here in our example of family car, is the tightest possible rectangle, the error region between C
and h = S is the sum of four rectangular strips.

• In this example, the analyser should make sure the that the probability of a positive example falling
in here (and causing an error) is at most ϵ.
• For any of these strips, if we can guarantee that the probability is upper bounded by ϵ /4, the
error is at most 4(ϵ /4) = ϵ.
• In this problem we count the overlaps in the corners twice, and the total actual error in this case is
less than 4(ϵ/4).
• The probability that a randomly drawn example misses this strip is 1 − ϵ /4.
• The probability that all N independent draws miss the strip is (1− ϵ / 4) N, and the probability that
all N independent draws miss any of the four strips is at most 4(1 − ϵ / 4) N, which we would like
to be at most δ.
• So if we choose N and δ such that we have, 4(1 − ϵ /4) N ≤ δ

1.3.4. Noise

• Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple model (hypothesis class).
• Humans are prone to making mistakes when collecting data, and data collection instruments may
be unreliable, resulting in dataset errors. The errors are referred to as noise.
• Data noise in machine learning can cause problems since the algorithm interprets the noise as a
pattern and can start generalizing from it.

• Some of the interpretations are represented below:


• There may be imprecision in recording the input attributes, which may shift the data points in the
input space.
• There may be errors in labelling the data points, which may relabel positive instances as negative
and vice versa. This is sometimes called teacher noise.
• There may be additional attributes, which we have not taken into account, that affect the label of
an instance. Such attributes may be hidden or latent in that they may be unobservable.
• The effect of these neglected attributes is thus modelled as a random component and is included in
“noise”.

1.4. Learning Multiple Classes

• In our example of learning a family car, we have positive examples belonging to the class family
car and the negative examples belonging to all other cars. This is a two-class problem.
• In the above diagram, there are three classes: family car, sports car, and luxury sedan. There are
three hypotheses induced, each one covering the instances of one class and leaving outside the
instances of the other two classes. ‘?’ are reject regions where no, or more than one, class is chosen.
• In this case, learning multiple classes comes in frame.
• In the general case, we have K, classes denoted as Ci, i = 1, . . . , K, and an input instance belongs
to one and exactly one of them.
• The training set is now of the form X = {xt, rt}𝑁
𝑡=1 , where r has K dimensions and

• For a given x, ideally only one of hi(x), i = 1, . . ., K is 1 and we can choose a class. But when no,
or two or more, hi(x) is 1, we cannot choose a class, and this is the case of doubt and the classifier
rejects such cases.
• In our example of learning a family car, we used only one hypothesis and only modelled the
positive examples. Any negative example outside is not a family car.
• Alternatively, sometimes we may prefer to build two hypotheses, one for the positive and the other
for the negative instances.
• This assumes a structure also for the negative instances that can be covered by another hypothesis.
• Separating family cars from sports cars is such a problem; each class has a structure of its own.
• The advantage is that if the input is a luxury sedan, we can have both hypotheses decide negative
and reject the input.
• If in a dataset, we expect to have all classes with similar distribution— shapes in the input space—
then the same hypothesis class can be used for all classes.
• For example, in a handwritten digit recognition dataset, we would expect all digits to have similar
distributions.
• But in a medical diagnosis dataset, for example, where we have two classes for sick and healthy
people, we may have completely different distributions for the two classes; there may be multiple
ways for a person to be sick, reflected differently in the inputs: All healthy people are alike; each
sick person is sick in his or her own way.
1.5. Regression

• In classification, given an input, the output that is generated is Boolean; it is a yes/no answer.
• When the output is a numeric value, what we would like to learn is not a class, C(x) ∈ {0, 1}, but
is a numeric function.
• In machine learning, the function is not known but we have a training set of examples drawn from
it.
• X = {xt, rt}𝑁
𝑡=1

• where rt ∈ R, If there is no noise, the task is interpolation. We would like to find the function f (x)
that passes through these points such that we have
rt = f (xt)
• In polynomial interpolation, given N points, we find the (N−1) st degree polynomial that we can
use to predict the output for any x. This is called extrapolation.
• In regression, there is noise added to the output of the unknown function.

• where f (x) ∈ R is the unknown function and ϵ is random noise.

1.5.1. Model Selection and Generalization

• Let us start with the case of learning a Boolean function from examples.
• In a Boolean function, all inputs and the output are binary.
• There are 2d possible ways to write d binary values and therefore, with d inputs, the training set
has at most 2d examples.

• In the above table, there is two inputs, there are four possible cases and sixteen possible Boolean
functions.
• As shown in table, each of these can be labeled as 0 or 1, and therefore, there are 2 2d possible
Boolean functions of d inputs.
• Each distinct training example removes half the hypotheses, namely, those whose guesses are
wrong.
• For example, let us say we have x1 = 0, x2 = 1 and the output is 0; this removes h5, h6, h7, h8, h13,
h14, h15, h16.
• This is one way to interpret learning: we start with all possible hypothesis and as we see more
training examples, we remove those hypotheses that are not consistent with the training data.
• In the case of a Boolean function, to end up with a single hypothesis we need to see all 2d training
examples.
• If the training set, we are given contains only a small subset of all possible instances, as it generally
does—that is, if we know what the output should be for only a small percentage of the cases—the
solution is not unique.
• After seeing N example cases, there remain 22d−N possible functions. This is an example of an ill-
posed problem where the data by itself is not sufficient to find a unique solution.
• So, because learning is ill-posed, and data by itself is not sufficient to find the solution, we should
make some extra assumptions to have a unique solution with the data we have. The set of
assumptions we make to have learning possible is called the inductive bias of the learning
algorithm.
• Thus, learning is not possible without inductive bias, and now the question is how to choose the
right bias. This is called model selection, which is choosing between possible H.
• In answering this question, we should remember that the aim of machine learning is rarely to
replicate the training data but the prediction for new cases.
• That is we would like to be able to generate the right output for an input instance outside the training
set, one for which the correct output is not given in the training set.
• How well a model trained on the training set predicts the right output for new instances is called
generalization.
• For best generalization, we should match the complexity of the hypothesis class H with the
complexity of the function underlying the data. If H is less complex than the function, we have
underfitting,
• If there is noise, an overcomplex hypothesis may learn not only the underlying function but also
the noise in the data and may make a bad fit. This is called overfitting.

You might also like