Unit Iii
Unit Iii
Introduction to machine learning - Linear Regression Models: Least squares, single & multiple
variables, Bayesian linear regression, gradient descent, Linear Classification Models: Discriminant
function - Probabilistic discriminative model - Logistic regression, Probabilistic generative model -
Naive Bayes, Maximum margin classifier - Support vector machine, Decision Tree, Random forests.
• Machine Learning (ML) is a sub-field of Artificial Intelligence (AI) which concerns with developing
computational theories of learning and building learning machines.
• Learning is a phenomenon and process which has manifestations of various aspects. Learning
process includes gaining of new symbolic knowledge and development of cognitive skills through
instruction and practice. It is also discovery of new facts and theories through observation and
experiment.
• Machine Learning Definition: A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with experience E.
• It is very hard to write programs that solve problems like recognizing a human face. We do not
know what program to write because we don't know how our brain does it. Instead of writing a
program by hand, it is possible to collect lots of examples that specify the correct output for a given
input.
• A machine learning algorithm then takes these examples and produces a program that does the
job. The program produced by the learning algorithm may look very different from a typical hand-
written program. It may contain millions of numbers. If we do it right, the program works for new
cases as well as the ones we trained it on.
• Main goal of machine learning is to devise learning algorithms that do the learning automatically
without human intervention or assistance. The machine learning paradigm can be viewed as
"programming by example." Another goal is to develop computational models of human learning
process and perform computer simulations.
• The goal of machine learning is to build computer systems that can adapt and learn from their
experience.
1. Training: A training set of examples of correct behavior is analyzed and some representation of the
newly learnt knowledge is stored. This is some form of rules.
2. Validation: The rules are checked and, if necessary, additional training is given. Sometimes
additional test data are used, but instead, a human expert may validate the rules, or some other
automatic knowledge - based component may be used. The role of the tester is often called the
opponent.
• Machine learning algorithms can figure out how to perform important tasks by generalizing from
examples.
• Machine learning provides business insight and intelligence. Decision makers are provided with
greater insights into their organizations. This adaptive technology is being used by global enterprises
to gain a competitive edge.
• Machine learning algorithms discover the relationships between the variables of a system (input,
output and hidden) from direct samples of the system.
1) Some tasks cannot be defined well, except by examples. For example: Recognizing people.
2) Relationships and correlations can be hidden within large amounts of data. To solve these
problems, machine learning and data these relationships.
3) Human designers often produce machines that do not work as well as desired in the environments
in which they are used.
4) The amount of knowledge available about certain tasks might be too large for explicit encoding by
humans.
• Machine learning also helps us find solutions of many problems in computer vision, speech
recognition and robotics. Machine learning uses the theory of statistics in building mathematical
models, because the core task is making inference from a sample.
1. Tasks: The problems that can be solved with machine learning. A task is an abstract representation
of a problem. The standard methodology in machine learning is to learn one task at a time. Large
problems are broken into small, reasonably independent sub-problems that are learned separately
and then recombined.
• Predictive tasks perform inference on the current data in order to make predictions. Descriptive
tasks characterize the general properties of the data in the database.
2. Models: The output of machine learning. Different models are geometric models, probabilistic
models, logical models, grouping and grading.
• The model-based approach seeks to create a modified solution tailored to each new application.
Instead of having to transform your problem to fit some standard algorithm, in model-based machine
learning you design the algorithm precisely to fit your problem.
• Model is just made up of set of assumptions, expressed in a precise mathematical form. These
assumptions include the number and types of variables in the problem domain, which variables
affect each other, and what the effect of changing one variable is on another variable.
• Machine learning models are classified as: Geometric model, Probabilistic model and Logical
model.
• Feature selection is a process that chooses a subset of features from the original features so that
the feature space is optimally reduced according to a certain criterion.
Types of Learning
• Learning is essential for unknown environments, i.e. when designer lacks the 10 me omniscience.
Learning simply means incorporating information from the training examples into the system.
• Learning is any change in a system that allows it to perform better the second time on repetition of
the same task or on another task drawn from the same population. One part of learning is acquiring
knowledge and new information; and the other part is problem-solving.
• Supervised and Unsupervised Learning are the different types of machine learning methods. A
computational learning model should be clear about the following aspects:
1. Learner: Who or what is doing the learning. For example: Program or algorithm.
6. Information source: The information (training data) the program uses for learning.
• Machine learning is a scientific discipline concerned with the design and development of the
algorithm that allows computers to evolve behaviors based on empirical data, such as form sensors
data or database.
• Machine learning is usually divided into two main types: Supervised Learning and Unsupervised
Learning.
2. Discover new things or structure that is unknown to humans (Example: Data mining).
Supervised Learning
• Supervised learning is the machine learning task of inferring a function from supervised training
data. The training data consist of a set of training examples. The task of the supervised learner is to
predict the output behavior of a system for any set of input values, after an initial training phase.
• Supervised learning in which the network is trained by providing it with input and matching output
patterns. These input-output pairs are usually provided by an external teacher.
• Human learning is based on the past experiences. A computer does not have experiences.
• A computer system learns from data, which represent some "past experiences" of an application
domain.
• To learn a target function that can be used to predict the values of a discrete class attribute, e.g.,
approve or not-approved and high-risk or low risk. The task is commonly called: Supervised learning,
Classification or inductive learning.
• Training data includes both the input and the desired results. For some examples the correct
results (targets) are known and are given in input to the model during the learning process. The
construction of a proper training, validation and test set is crucial. These methods are usually fast
and accurate.
• Have to be able to generalize: give the correct results when new data are given in input without
knowing a priori the target.
• Supervised learning is the machine learning task of inferring a function from supervised training
data. The training data consist of a set of training examples. In supervised learning, each example is a
pair consisting of an input object and a desired output value.
• A supervised learning algorithm analyzes the training data and produces an inferred function,
which is called a classifier or a regression function. Fig. 8.2.1 shows supervised learning process.
• The learned model helps the system to perform task better as compared to no learning.
• Supervised learning is further divided into methods which use reinforcement or error correction.
The perceptron learning algorithm is an example of supervised learning with reinforcement.
In order to solve a given problem of supervised learning, following steps are 1.8 performed:
4. Determine the structure of the learned function and corresponding learning algorithm.
5. Complete the design and then run the learning algorithm on the collected training set.
6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the
performance of the resulting function should be measured on a test set that is separate from the
training set.
• The model is not provided with the correct results during the training. It can be used to cluster the
input data in classes on the basis of their statistical properties only. Cluster significance and labeling.
• The labeling can be carried out even if the labels are only available for a small number of objects
representative of the desired classes. All similar inputs patterns are grouped together as clusters.
• If matching pattern is not found, a new cluster is formed. There is no error feedback.
• External teacher is not used and is based upon only local information. It is also referred to as self-
organization.
• They are called unsupervised because they do not need a teacher or super-visor to label a set of
training examples. Only the original data is required to start the analysis.
• In contrast to supervised learning, unsupervised or self-organized learning does not require an
external teacher. During the training session, the neural network boy receives a number of different
input patterns, discovers significant features in these patterns and learns how to classify input data
into appropriate categories.
• Unsupervised learning algorithms aim to learn rapidly and can be used in real-time. Unsupervised
learning is frequently employed for data clustering, feature extraction etc.
• Another mode of learning called recording learning by Zurada is typically employed for associative
memory networks. An associative memory networks is designed by recording several idea patterns
into the networks stable states.
Semi-supervised Learning
• Semi-supervised learning uses both labeled and unlabeled data to improve supervised learning.
The goal is to learn a predictor that predicts future test data better than the predictor learned from
the labeled training data alone.
• Semi-supervised learning is motivated by its practical value in learning faster, better and cheaper.
In many real world applications, it is relatively easy to acquire a large amount of unlabeled data x.
• For example, documents can be crawled from the Web, images can be obtained from surveillance
cameras, and speech can be collected from broadcast. However, their corresponding labels y for the
prediction task, such as sentiment orientation, intrusion detection and phonetic transcript, often
requires slow human annotation and expensive laboratory experiments.
• In many practical learning domains, there is a large supply of unlabeled data but limited labeled
data, which can be expensive to generate. For example: text processing, video-indexing,
bioinformatics etc.
• Semi-supervised Learning makes use of both labeled and unlabeled data for training, typically a
small amount of labeled data with a large amount of unlabeled data. When unlabeled data is used in
conjunction with a small amount of labeled data, it can produce considerable improvement in
learning accuracy.
• Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of
unlabeled data.
Reinforced Learnings
• User will get immediate feedback in supervised learning and no feedback from unsupervised
learning. But in the reinforced learning, you will get delayed scalar feedback.
• Reinforcement learning is learning what to do and how to map situations to actions. The learner is
not told which actions to take. Fig. 8.2.3 shows concept of reinforced learning.
• Reinforced learning is deals with agents that must sense and act upon their environment. It
combines classical Artificial Intelligence and machine learning techniques.
• It allows machines and software agents to automatically determine the ideal behavior within a
specific context, in order to maximize its performance. Simple reward feedback is required for the
agent to learn its behavior; this is known as the reinforcement signal.
• Two most important distinguishing features of reinforcement learning is trial-and-error and delayed
reward.
• With reinforcement learning algorithms an agent can improve its performance by using the
feedback it gets from the environment. This environmental feedback is and called the reward signal.
• Based on accumulated experience, the agent needs to learn which action to take in D as a given
situation in order to obtain a desired long term goal. Essentially actions that lead to long term
rewards need to reinforced. Reinforcement learning has connections with control theory, Markov
decision processes and game theory.
• Example of reinforcement learning: A mobile robot decides whether it should bow enter a new
room in search of more trash to collect or start trying to find its way air back to its battery recharging
station. It makes its decision based on how quickly and easily it has been able to find the recharger in
the past.
1. Policy
2. Reward function
3. Value function
• Policy: Policy defines the learning agent behavior for given time period. It is a mapping from
perceived states environment to actions to be taken when in those states.
• Reward Function: Reward function is used to define a goal in a reinforcement learning problem. It
also maps each perceived state of the environment to a single number.
• Value function: Value functions specify what is good in the long run. The value of a state is the total
amount of reward an agent can expect to accumulate over the future, starting from that state.
• Credit assignment problem: Reinforcement learning algorithms learn to generate an internal value
for the intermediate states as to how good they are in leading to the goal.
• The learning decision maker is called the agent. The agent interacts with the environment that
includes everything outside the agent.
• The agent has sensors to decide on its state in the environment and takes an action that modifies
its state.
• Reinforcement learning uses a formal framework defining the interaction between a learning agent
and its environment in terms of states, actions, and rewards. This framework is intended to be a
simple way of representing essential features of the artificial intelligence problem.
Regression
• Regression finds correlations between dependent and independent variables. If the desired output
consists of one or more continuous variable, then the task is called as regression.
• Therefore, regression algorithms help predict continuous variables such as house prices, market
trends, weather patterns, oil and gas prices etc.
• Regression analysis is a set of statistical methods used for the estimation of relationships between
a dependent variable and one or more independent variables. It can be utilized to assess the
strength of the relationship between variables and for modelling the future relationship between
them.
• The two basic types of regression are linear regression and multiple linear regression.
• Linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables.
• The objective of a linear regression model is to find a relationship between the input variables and
a target variable.
1. One variable, denoted x, is regarded as the predictor, explanatory or buy independent variable.
2. The other variable, denoted y, is regarded as the response, outcome or dependent variable.
• Regression models predict a continuous variable, such as the sales made on a day predict
temperature of a city. Let's imagine that we fit a line with the training point that we have. If we want
to add another data point, but to fit it, we need to change existing model.
• This will happen with each data point that we add to the model; hence, linear regression isn't good
for classification models.
• Regression estimates are used to explain the relationship between one dependent variable and
one or more independent variables. Classification predicts categorical labels (classes), prediction
models continuous - valued functions. Classification is considered to be supervised learning.
• Classifies data based on the training set and the values in a classifying attribute and uses it in
classifying new data. Prediction means models continuous - valued functions, i.e. predicts unknown
or missing values.
• The regression line gives the average relationship between the two variables in mathematical form.
• For two variables X and Y, there are always two lines of regression.
• Regression line of X on Y Gives the best estimate for the value of X for any specific given values of Y:
X=a+bY
Where a= X - intercept
X = Dependent variable
Y = Independent variable
• Regression line Y on X: Gives the best estimate for the value of Y for any specific given values of X:
Y = a + bx
Where a = Y - intercept
Y = Dependent variable
X = Independent variable
• By using the least squares method (a procedure that minimizes the vertical deviations of plotted
points surrounding a straight line) we are able to construct a best fitting straight line to the scatter
diagram points and then formulate a regression equation in the form of :
y = a + bx
y = y +b(x-x)
• Regression analysis is the art and science of fitting straight lines to patterns of data. In a linear
regression model, the variable of interest ("dependent" variable) is predicted from k other variables
("independent" variables) using a linear equation. If Y denotes the dependent variable and X1,..., Xk,
are the independent variables, then the assumption is that the value of Y at time t in the data sample
is determined by the linear equation :
where the betas are constants and the epsilons are independent and identically distributed normal
random variables with mean zero.
• At each split point, the "error" between the predicted value and the actual values is squared to get
a "Sum of Squared Errors (SSE)". The split point errors across the variables are compared and the
variable/point yielding the lowest SSE is chosen as the root node/split point. This process is
recursively continued.
• Error function measures how much our predictions deviate from the desired answers.
Mean-squared error Jn = 1/n Σi = 1…n (yi – f(xi))2
Advantages:
a. Training a linear regression model is usually much faster than methods such as neural networks.
b. Linear regression models are simple and require minimum memory to implement.
c. By examining the magnitude and sign of the regression coefficients you can infer how predictor
variables affect the target outcome.
Least squares
• The method of least squares is about estimating parameters by minimizing the squared
discrepancies between observed data, on the one hand, and their expected al values on the other.
• Considering an arbitrary straight line, y = b0 +b1 x, is to be fitted through these data points. The
question is "Which line is the most representative"?
• What are the values of b0 and b1 such that the resulting line "best" fits the data points? But, what
goodness-of-fit criterion to use to determine among all possible combinations of nob b0 and b1?
• The Least Squares (LS) criterion states that the sum of the squares of errors is minimum. The least-
squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the
range [0,1].
• How to draw such a line based on data points observed? Suppose a imaginary line of y = a + bx.
• Imagine a vertical distance between the line and a data point E = Y - E(Y).
• This error is the deviation of the data point from the imaginary line, regression line. Then what is
the best values of a and b? A and b that minimizes the sum of such errors.
• Deviation does not have good properties for computation. Then why do we use squares of
deviation? Let us get a and b that can minimize the sum of squared deviations rather than the sum of
deviations. This method is called least squares.
• Least squares method minimizes the sum of squares of errors. Such a and b are called least squares
estimators i.e. estimators of parameters a and B.
• The process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares
method is the estimation method of Ordinary Least Squares (OLS).
Example 8.3.1 Fit a straight line to the points in the table. Compute m and b by least squares.
• Regression analysis is used to predict the value of one or more responses from a set of predictors.
It can also be used to estimate the linear association between the predictors and responses.
Predictors can be continuous or categorical or a mixture ben of both.
• If multiple independent variables affect the response variable, then the analysis calls for a model
different from that used for the single predictor variable. In a situation where more than one
independent factor (variable) affects the outcome of process, a multiple regression model is used.
This is referred to as multiple linear regression model or multivariate least squares fitting.
• Let z1 ; z2;:::; zt, be a set of r predictors believed to be related to a response variable Y. The linear
regression model for the jth sample unit has the form
where ε is a random error and β1,i= 0, 1, ..., r are un-known regression coefficients.
• With n independent observations, we can write one model for each sample unit so that the model
is now
Y = Zβ + ε
• In order to estimate ẞ, we take a least squares approach that is analogous to what we did in the
simple linear regression case.
• Bayesian linear regression allows a useful mechanism to deal with insufficient data, or poor
distributed data. It allows user to put a prior on the coefficients and on the noise so that in the
absence of data, the priors can take over. A prior is a distribution on a parameter.
• If we could flip the coin an infinite number of times, inferring its bias would be easy by the law of
large numbers. However, what if we could only flip the coin a handful of times? Would we guess that
a coin is biased if we saw three heads in three flips, an event that happens one out of eight times
with unbiased coins? The MLE would overfit these data, inferring a coin bias of p =1.
• A Bayesian approach avoids overfitting by quantifying our prior knowledge that most coins are
unbiased, that the prior on the bias parameter is peaked around one-half. The data must overwhelm
this prior belief about coins.
• Bayesian methods allow us to estimate model parameters, to construct model forecasts and to
conduct model comparisons. Bayesian learning algorithms can calculate explicit probabilities for
hypotheses.
• Bayesian classifiers use a simple idea that the training data are utilized to calculate an observed
probability of each class based on feature values.
• When Bayesian classifier is used for unclassified data, it uses the observed probabilities to predict
the most likely class for the new features.
• Each observed training example can incrementally decrease or increase the estimated probability
that a hypothesis is correct.
• Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided by asserting a prior probability for each
candidate hypotheses and a probability distribution over observed data for each possible hypothesis.
• Bayesian methods canaccommodate hypotheses that make probabilistic predictions. New instances
can be classified by combining the predictions of multiple hypotheses, weighted by their
probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can provide a
standard of optimal decision making against which other practical bus () methods can be measured.
2. Medical diagnosis.
ii) Create a model mapping the training inputs to the training outputs.
iii) Have a Markov Chain Monte Carlo (MCMC) algorithm draw samples from the posterior
distributions for the parameters
Gradient Descent
• First and second derivatives of the objective function or the constraints play an important role in
optimization. The first order derivatives are called the gradient and the second order derivatives are
called the Hessian matrix.
• Derivative based optimization is also called nonlinear. Capable of determining search directions"
according to an objective function's derivative information.
1. Steepest descent
2. Newton-Raphson method
Gradient Descent:
• Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using
gradient descent, one takes steps proportional to the negative of the gradient of the function at the
current point.
• Gradient descent is popular for very large-scale optimization problems because it is easy to
implement, can handle black box functions, and each iteration is cheap.
• Given a differentiable scalar field f (x) and an initial guess x1, gradient descent iteratively moves the
guess toward lower values of "f" by taking steps in the direction of the negative gradient - f (x).
• Locally, the negated gradient is the steepest descent direction, i.e., the direction that x would need
to move in order to decrease "f" the fastest. The algorithm typically converges to a local minimum,
but may rarely reach a saddle point, or not move at all if x1 lies at a local maximum.
• The gradient will give the slope of the curve at that x and its direction will point to an increase in
the function. So we change x in the opposite direction to lower the function value:
Xk+ 1 = xk − λ f (xk)
The λ > 0 is a small number that forces the algorithm to make small jumps
• Gradient descent is relatively slow close to the minimum technically, its asymptotic rate of
convergence is inferior to many other methods.
• For poorly conditioned convex problems, gradient descent increasingly 'zigzags' as the gradients
point nearly orthogonally to the shortest direction to a minimum point
Steepest Descent:
• This method is based on first order Taylor series approximation of objective function. This method
is also called saddle point method. Fig. 8.3.5 shows steepest descent method.
• The Steepest Descent is the simplest of the gradient methods. The choice of direction is where f
decreases most quickly, which is in the direction opposite to f (xi). The search starts at an arbitrary
point x0 and then go down the gradient, until reach close to the solution.
• The method of steepest descent is the discrete analogue of gradient descent, but the best move is
computed using a local minimization rather than computing a gradient. It is typically able to converge
in few steps but it is unable to escape local minima or plateaus in the objective function.
• The gradient is everywhere perpendicular to the contour lines. After each line minimization the
new gradient is always orthogonal to the previous step direction. Consequently, the iterates tend to
zig-zag down the valley in a very manner.
• The method of Steepest Descent is simple, easy to apply, and each iteration is fast. It also very
stable; if the minimum points exist, the method is guaranteed to locate them after at least an infinite
number of iterations.
• A classification algorithm (Classifier) that makes its classification based on a linear predictor
function combining a set of weights with the feature vector.
• A linear classifier does classification decision based on the value of a linear combination of the
characteristics. Imagine that the linear classifier will merge into it's weights all the characteristics that
define a particular class.
• Linear classifiers can represent a lot of things, but they can't represent everything. The classic
example of what they can't represent is the XOR function.
Discriminant Function
• Linear Discriminant Analysis (LDA) is the most commonly used dimensionality reduction technique
in supervised learning. Basically, it is a preprocessing step for pattern classification and machine
learning applications. LDA is a powerful algorithm that can be used to determine the best separation
between two or more classes.
• LDA is a supervised learning algorithm, which means that it requires a labelled training set of data
points in order to learn the linear discriminant function.
• The main purpose of LDA is to find the line or plane that best separates data points belonging to
different classes. The key idea behind LDA is that the decision boundary should be chosen such that
it maximizes the distance between the means of the two classes while simultaneously minimizing the
variance within of each class's data or within-class scatter. This criterion is known as the Fisher nib
criterion.
• LDA is one of the most widely used machine learning algorithms due to its accuracy and flexibility.
LDA can be used for a variety of tasks such as classification, dimensionality reduction, and feature
selection.
• Suppose we have two classes and we need to classify them efficiently, then using LDA, classes are
divided as follows:
• LDA algorithm works based on the following steps:
a) The first step is to calculate the means and standard deviation of each feature.
b) Within class scatter matrix and between class scatter matrix is calculated
c) These matrices are then used to calculate the eigenvectors and eigenvalues.
d) LDA chooses the k eigenvectors with the largest eigenvalues to form a transformation matrix.
e) LDA uses this transformation matrix to transform the data into a new space with k dimensions.
f) Once the transformation matrix transforms the data into new space with k dimensions, LDA can
then be used for classification or dimensionality -reduction
c) LDA is not susceptible to the "curse of dimensionality" like many other machine learning
algorithms.
Logistic Regression
• Logistic regression is a form of regression analysis in which the outcome variable is binary or
dichotomous. A statistical method used to model dichotomous or binary outcomes using predictor
variables.
• Logistic component: Instead of modeling the outcome, Y, directly, the method models the log odds
(Y) using the logistic function.
• Regression component: Methods used to quantify association between an outcome and predictor
variables. It could be used to build predictive models as a function of predictors.
Logistic Regression:
In[P(Y)/1-P(Y)] = β0 + β1 X1 + β2 X2 +...+ βk Xk
Y = β0 + β1 X1 + β2 X2 +...+ βk Xk + ε
• With logistic regression, the response variable is an indicator of some characteristic, that is, a 0/1
variable. Logistic regression is used to determine whether other measurements are related to the
presence of some characteristic, for example, whether certain blood measures are predictive of
having a disease.
• İf analysis of covariance can be said to be a t test adjusted for other variables, then logistic
regression can be thought of as a chi-square test for homogeneity of proportions adjusted for other
variables. While the response variable in a logistic regression is a 0/1 variable, the logistic regression
equation, which is a linear equation, does not predict the 0/1 variable itself.
Linear Regression :
p = a0 + a1 X1 +a2 X2 +...+ak Xk
Logistic Regression:
• The linear model assumes that the probability p is a linear function of the regressors, while the
logistic model assumes that the natural log of the odds p/(1 - p) is a linear function of the regressors.
• The major advantage of the linear model is its interpretability. In the linear model, if a 1 is 0.05,
that means that a one-unit increase in X1 is associated with a 5% point increase in the probability
that Y is 1.
• The logistic model is less interpretable. In the logistic model, if b1 is 0.05, that means that a one-
unit increase in X1 is associated with a 0.05 increase in the log odds that Y is 1. And what does that
mean? I've never met anyone with any intuition for log odds.
• Generative models are a class of statistical models that generate new data instances. These models
are used in unsupervised machine learning to perform tasks such as probability and likelihood
estimation, modelling data points, and distinguishing between classes using these probabilities.
• Generative models rely on the Bayes theorem to find the joint probability. Generative models
describe how data is generated using probabilistic models. They predict P(y | x), the probability of y
given x, calculating the P(x,y), the probability of x and y.
Naive Bayes
• Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes'
theorem with strong independence assumptions between the features. It is highly scalable, requiring
a number of parameters linear in the number of variables (features/predictors) in a learning
problem.
• A Naive Bayes Classifier is a program which predicts a class value given a set of attributes.
2. Use the product rule to obtain a joint conditional probability for the attributes.
3. Use Bayes rule to derive conditional probabilities for the class variable.
• Once this has been done for all class values, output the class with the highest probability.
• Naive bayes simplifies the calculation of probabilities by assuming that the probability of each
attribute belonging to a given class value is independent of all other attributes. This is a strong
assumption but results in a fast and effective method.
• The probability of a class value given a value of an attribute is called the conditional probability. By
multiplying the conditional probabilities together for each attribute for a given class value, we have a
probability of a data instance belonging to that class.
Conditional Probability
• Let A and B be two events such that P(A) > 0. We denote P(BIA) the probability of B given that A has
occurred. Since A is known to have occurred, it becomes the new sample space replacing the original
S. From this, the definition is,
P(B/A) = P(A∩B)/P(A)
OR
• The notation P(B | A) is read "the probability of event B given event A". It is the probability of an
event B given the occurrence of the event A.
• We say that, the probability that both A and B occur is equal to the probability that A occurs times
the probability that B occurs given that A has occurred. We call P(B | A) the conditional probability of
B given A, i.e., the probability that B will occur given that A has occurred.
P(A/B) = P(A∩B)/P(B)
• The probability P(A | B) simply reflects the fact that the probability of an event A may depend on a
second event B. If A and B are mutually exclusive A ∩ B = and P(A | B) = 0.
Joint Probability
• A joint probability is a probability that measures the likelihood that two or more events will happen
concurrently.
• If there are two independent events A and B, the probability that A and B will occur is found by
multiplying the two probabilities. Thus for two events A and B, the special rule of multiplication
shown symbolically is :
• The general rule of multiplication is used to find the joint probability that two events will occur.
Symbolically, the general rule of multiplication is,
• The probability P(A ∩ B) is called the joint probability for two events A and B which intersect in the
sample space. Venn diagram will readily shows that
Equivalently:
• The probability of the union of two events never exceeds the sum of the event probabilities.
• A tree diagram is very useful for portraying conditional and joint probabilities. A tree diagram
portrays outcomes that are mutually exclusive.
Bayes Theorem
• Bayes' theorem is a method to revise the probability of an event given additional information.
Bayes's theorem calculates a conditional probability called a posterior or revised probability.
• Bayes' theorem is a result in probability theory that relates conditional probabilities. If A and B
denote two events, P(A | B) denotes the conditional probability of A occurring, given that B occurs.
The two conditional probabilities P(A | B) and P(B | A) are in general different.
• Bayes theorem gives a relation between P(A | B) and P(B | A). An important application of Bayes'
theorem is that it gives a rule how to update or revise the strengths of evidence-based beliefs in light
of new evidence a posterior.
• A prior probability is an initial probability value originally obtained before any additional
information is obtained.
• A posterior probability is a probability value that has been revised by using additional information
that is later obtained.
• Suppose that B1, B2, B3 ... Bn partition the outcomes of an experiment and that A is another event.
For any number, k, with 1 ≤ k ≤ n, we have the formula:
Support Vector Machines (SVMs)are a set of supervised learning methods which learn from the and
used for dataset classification. SVM is a classifier derived from statistical learning theory by
Chervonenkis.
• An SVM is a kind of large-margin classifier: It is a vector space based machine learning method
where the goal is to find a decision boundary between two classes that is maximally far from any
point in the training data
• Given a set of training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one class or the other.
Simply speaking, we can think of an SVM model as representing the examples as points in space,
mapped so that each of the examples of the separate classes are divided by a gap that is as wide as
possible.
• New examples are then mapped into the same space and classified to belong to the class based on
which side of the gap they fall on.
• Many decision boundaries can separate these two classes. Which one should we choose?
• Perceptron learning rule can be used to find any decision boundary between class 1 and class 2.
• The line that maximizes the minimum margin is a good bet. The model class of "hyper-planes with
a margin of m" has a low VC dimension if m is big.
• This maximum-margin separator is determined by a subset of the data points. Data points in this
subset are called "support vectors". It will be useful computationally if only a small fraction of the
data points are support vectors, because we use the support vectors to decide which side of the
separator a test case is on.
• SVM are primarily two-class classifiers with the distinct characteristic that they aim to find the
optimal hyperplane such that the expected generalization error is minimized. Instead of directly
minimizing the empirical risk calculated from the training data, SVMs perform structural risk
minimization to achieve good generalization.
• The empirical risk is the average loss of an estimator for a finite set of data drawn from P. The idea
of risk minimization is not only measure the performance of an estimator by its risk, but to actually
search for the estimator that minimizes risk over distribution P. Because we don't know distribution P
we instead minimize empirical risk over a training dataset drawn from P. This general learning
technique is called empirical risk minimization.
• The decision boundary should be as far away from the data of both classes as possible. If data
points lie very close to the boundary, the classifier may be consistent but is more "likely" to make
errors on new instances from the distribution. Hence, we prefer classifiers that maximize the minimal
distance of data points to the separator.
1. Margin (m): the gap between data points & the classifier boundary. The Margin is the minimum
distance of any sample to the decision boundary. If this hyperplane is in the canonical form, the
margin can be measured by the length of the weight vector.The margin is given by the projection of
the distance between these two points on the direction perpendicular to the hyperplane.
Example 8.6.1 For the following figure find a linear hyperplane (decision boundary) that will separate
the data.
Solution:
2. Extend the above definition for non-linearly separable problems have a penalty term for
misclassifications
3. Map data to high dimensional space where it is easier to classify with linear of by decision
surfaces: reformulate problem so that data is mapped implicitly to this space
1. Use a single hyperplane which subdivides the space into two half-spaces, one which is occupied by
Class 1 and the other by Class 2
2. They maximize the margin of the decision boundary using quadratic optimization techniques
which find the optimal hyperplane.
5. When used in practice, SVM approaches frequently map the examples to a higher dimensional
space and find margin maximal hyperplanes in the mapped space, obtaining decision boundaries
which are not hyperplanes in the original space.
6. The most popular versions of SVMs use non-linear kernel functions and map the attribute space
into a higher dimensional space to facilitate finding "good" linear decision boundaries in the
modified space.
SVM Applications
2. Image classification
Limitations of SVM
1. It is sensitive to noise.
4. The optimal design for multiclass SVM classifiers is also a research area.
• For the very high dimensional problems common in text classification, sometimes the data are
linearly separable. But in the general case they are not, and even if they are, we might prefer a
solution that better separates the bulk of the data 1st while ignoring a few weird noise documents.
• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
• A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing
misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications. This is a
trade off between the hyperplane violations and the margin size. The slack variables are bounded by
some set cost. The farther they are from the soft margin, the less influence they have on the
prediction.
2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane
3. C is the trade off between the slack variable penalty and the margin.
Example 8.6.2 From the following diagram, identify which data points (1, 2, 3, 4, 5) are support
vectors (if any), slack variables on correct side of classifier (if any) and slack variables on wrong side
of classifier (if any). Mention which point will have maximum penalty and why?
Solution:
• Maximal margin classifier: A classifier in the family F that maximizes the margin. Maximizing the
margin is good according to intuition and PAC theory. Implies that only support vectors matter; other
training examples are ignorable.
• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.
• A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing
misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications. This is a
trade off between the hyperplane violations and the margin size. The slack variables are bounded by
some set cost. The farther they are from the soft margin, the less influence they have on the
prediction.
2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane.
3. C is the tradeoff between the slack variable penalty and the margin.
Decision Tree
• A decision tree is a simple representation for classifying examples. Decision tree learning is one of
the most successful techniques for supervised classification learning.
• In decision analysis, a decision tree can be used to visually and explicitly represent decisions and
decision making. As the name goes, it uses a tree-like model of decisions.
• Learned trees can also be represented as sets of if-then rules to improve human readability.
1. Each leaf node has a class label, determined by majority vote of training examples reaching that
leaf.
2. Each internal node is a question on features. It branches out according to the answers.
• Decision tree learning is a method for approximating discrete-valued target functions. The learned
function is represented by a decision tree.
• A learned decision tree can also be re-represented as a set of if-then rules. Decision tree learning is
one of the most widely used and practical methods for inductive inference.
• Goal: Build a decision tree for classifying examples as positive or negative instances of a concept
• Supervised learning, batch processing of training examples, using a preference bias.
C. Each arc has associated with it one of the possible values of the attribute at the node from which
the arc is directed.
• Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf nodes
represent class labels or class distribution.
• A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules.
Input:
2. Attribute list
Algorithm:
4. If attribute list is empty then return N as a leaf node labeled with the majority class in D
5. Apply attribute selection method (D, attribute list) to find the "best" splitting criterion;
11. If Dj is empty then attach a leaf labeled with the majority class in D to node N;
12. Else attach the node returned by Generate decision tree (Dj, attribute list) to node N;
14. Return N;
• Decision tree generation consists of two phases: Tree construction and pruning
• In tree construction phase, all the training examples are at the root. Partition examples recursively
based on selected attributes.
• In tree pruning phase, the identification and removal of branches that reflect noise or outliers.
• There are various paradigms that are used for learning binary classifiers which include:
1. Decision Trees
2. Neural Networks
3. Bayesian Classification
Example 8.7.1 Using following feature tree, write decision rules for majority class.
Solution: Left Side: A feature tree combining two Boolean features. Each internal node or split is
labelled with a feature, and each edge emanating from a split is labelled with a feature value. Each
leaf therefore corresponds to a unique combination of feature values. Also indicated in each leaf is
the class distribution derived from the training set
• Right Side: A feature tree partitions the instance space into rectangular regions, one for each leaf.
• The leaves of the tree in the above figure could be labelled, from left to right, as ham - spam -
spam, employing a simple decision rule called majority class.
• Left side: A feature tree with training set class distribution in the leaves.
• Right side: A decision tree obtained using the majority class decision rule.
• Decision tree learning is generally best suited to problems with the following characteristics:
1. Instances are represented by attribute-value pairs. Fixed set of attributes, and the attributes take a
small number of disjoint possible values.
2. The target function has discrete output values. Decision tree learning is appropriate for a boolean
classification, but it easily extends to learning functions with more than two possible output values.
4. The training data may contain errors. Decision tree learning methods are robust to errors, both
errors in classifications of the training examples and errors in the attribute values that describe these
examples.
5. The training data may contain missing attribute values. Decision tree methods can be used even
when some training examples have unknown values.
6. Decision tree learning has been applied to problems such as learning to classify.
Advantages and Disadvantages of Decision Tree
Advantages:
3. Decision trees are capable of handling datasets that may have errors.
4. Decision trees are capable of handling datasets that may have missing values.
Disadvantages:
1. Most of the algorithms require that the target attribute will have only discrete values.
3. Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a
continuous attribute.
4. Decision trees are prone to errors in classification problems with many class and se relatively small
number of training examples.
Random Forests
• Random forest is a famous system learning set of rules that belongs to the supervised getting to
know method. It may be used for both classification and regression issues in ML. It is based totally on
the concept of ensemble studying, that's a process of combining multiple classifiers to solve a
complex problem and to enhance the overall performance of the model.
• As the call indicates, "Random forest is a classifier that incorporates some of choice timber on
diverse subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each
tree and primarily based on most of the people's votes of predictions, and it predicts the very last
output.
• The more wider variety of trees within the forest results in better accuracy and prevents the hassle
of overfitting.
• Random forest works in two-section first is to create the random woodland by combining N
selection trees and second is to make predictions for each tree created inside the first segment.
• The working technique may be explained within the below steps and diagram:
Step 2: Build the selection trees associated with the selected information points (Subsets).
Step 3: Choose the wide variety N for selection trees which we want to build.
• The working of the set of rules may be higher understood by the underneath example:
• Example: Suppose there may be a dataset that includes more than one fruit photo. So, this dataset
is given to the random wooded area classifier. The dataset is divided into subsets and given to every
decision tree. During the training section, each decision tree produces a prediction end result and
while a brand new statistics point occurs, then primarily based on the majority of consequences, the
random forest classifier predicts the final decision. Consider the underneath picture:
SAT
1. Banking: Banking zone in general uses this algorithm for the identification of loan danger.
2. Medicine: With the assistance of this set of rules, disorder traits and risks of the disorder may be
recognized.
3. Land use: We can perceive the areas of comparable land use with the aid of this algorithm.
• Although random forest can be used for both class and regression responsibilities, it isn't extra
appropriate for regression obligations.
Ans.: Learning is a phenomenon and process which has manifestations of various aspects. Learning
process includes gaining of new symbolic knowledge and development of cognitive skills through
instruction and practice. It is also discovery of new facts and theories through observation and
experiment.
Ans.: A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.
Ans.: Information theory is measures of entropy and information content. Minimum description
length approaches to learning. Optimal codes and their relationship to optimal training sequences for
encoding a hypothesis.
Ans.: Target function is a method for solving a problem that an AI algorithm parses its training data
to find. Once an algorithm finds its target function, that function can be used to predict results. The
function can then be used to find output data related to inputs for real problems where, unlike
training sets, outputs are not included.
Ans.: One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge held
by the learner.
• When and how prior knowledge can guide the learning process?
• What is the best way to reduce the learning task to one or more function approximation problems?
• How can the learner automatically alter its representation to improve its learning ability?
Q.7 What is decision tree?
Ans.:
• Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree.
• A decision tree is a tree where each node represents a feature (attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome (categorical or continues value).
• A decision tree or a classification tree is a tree in which each internal node is labeled with an input
feature. The arcs coming from a node labeled with a feature are labeled with each of the possible
values of the feature.
1. Each leaf node has a class label, determined by majority vote of training examples reaching that
leaf.
2. Each internal node is a question on features. It branches out according to the answers.
• Decision tree learning is a method for approximating discrete-valued target functions. The learned
function is represented by a decision tree
Ans.: When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of overfitting the data. Such
methods typically use statistical measures to remove the least reliable branches.
Ans.: Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data
Ans.:
4. Sort the pruned rules by their estimated accuracy and consider them in this sequence when
classifying unseen instances
Ans.:
• Converting to rules allows distinguishing among the different contexts in which a decision node is
used.
• Converting to rules removes the distinction between attribute tests that occur near the root of the
tree and those that occur near the leaves.
• Converting to rules improves readability. Rules are often easier for to if understand
Ans.: Least squares is a statistical method used to determine a line of best fit by minimizing the sum
of squares created by a mathematical function. A "square" is determined by squaring the distance
between a data point and the regression line or mean value of the data set
Ans.: LDA is a supervised learning algorithm, which means that it requires a labelled training set of
data points in order to learn the Linear Discriminant function.
Ans.: Support vectors are data points that are closer to the hyperplane and influence the position
and orientation of the hyperplane. Using these support vectors, we maximize the margin of the
classifier.
Ans.: A Support Vector Machine (SVM) is a supervised machine learning model that uses
classification algorithms for two-group classification problems. After giving an SVM model sets of
labeled training data for each category, they're able to categorize new text.
Ans.: Logistic regression is supervised learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Ans.: Types of machine learnings are supervised, semi-supervised, unsupervised and Reinforcement
Learning.
Ans.: Random forest is an ensemble learning technique that combines multiple decision trees,
implementing the bagging method and results in a robust model with low variance.
Ans.: Popular algorithms are Decision Trees, Neural Networks (back propagation), Probabilistic
networks, Nearest Neighbor and Support vector machines.
Ans.: Function of 'Supervised Learning' are Classifications, Speech recognition, Regression, Predict
time series and Annotate strings.
Ans.: Regression is a method to determine the statistical relationship between a dependent variable
and one or more independent variables.
Ans.: In linear regression models, the dependence of the response on the regressors is defined by a
linear function, which makes their statistical analysis mathematically tractable. On the other hand, in
nonlinear regression models, this dependence is defined by a nonlinear function, hence the
mathematical difficulty in their analysis.
Ans.: Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor). This technique is
used for forecasting, time series modelling and finding the causal effect relationship between the
variables.
Ans.:
1. The dependent variable in logistic regression follows Bernoulli Distribution. 2. Estimation is done
through maximum likelihood.
Ans.: The goal of logistic regression is to correctly predict the category of outcome for individual
cases using the most parsimonious model. To accomplish this goal, a model is created that includes
all predictor variables that are useful in predicting the response variable.
Ans.: Supervised learning in which the network is trained by providing it with input and matching
output patterns. These input-output pairs are usually provided by an external teacher.