0% found this document useful (0 votes)
24 views39 pages

Unit Iii

This document provides an overview of supervised learning, a key area of machine learning, detailing its definition, importance, and phases including training, validation, and application. It contrasts supervised learning with unsupervised and semi-supervised learning, highlighting their methodologies and use cases. Additionally, it introduces reinforcement learning, explaining its elements and how it differs from the other learning types.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views39 pages

Unit Iii

This document provides an overview of supervised learning, a key area of machine learning, detailing its definition, importance, and phases including training, validation, and application. It contrasts supervised learning with unsupervised and semi-supervised learning, highlighting their methodologies and use cases. Additionally, it introduces reinforcement learning, explaining its elements and how it differs from the other learning types.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT III

Chapter: 8: Supervised Learning


Syllabus

Introduction to machine learning - Linear Regression Models: Least squares, single & multiple
variables, Bayesian linear regression, gradient descent, Linear Classification Models: Discriminant
function - Probabilistic discriminative model - Logistic regression, Probabilistic generative model -
Naive Bayes, Maximum margin classifier - Support vector machine, Decision Tree, Random forests.

Introduction to Machine Learning

• Machine Learning (ML) is a sub-field of Artificial Intelligence (AI) which concerns with developing
computational theories of learning and building learning machines.

• Learning is a phenomenon and process which has manifestations of various aspects. Learning
process includes gaining of new symbolic knowledge and development of cognitive skills through
instruction and practice. It is also discovery of new facts and theories through observation and
experiment.

• Machine Learning Definition: A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with experience E.

• Machine learning is programming computers to optimize a performance criterion using example


data or past experience. Application of machine learning methods to large databases is called data
mining.

• It is very hard to write programs that solve problems like recognizing a human face. We do not
know what program to write because we don't know how our brain does it. Instead of writing a
program by hand, it is possible to collect lots of examples that specify the correct output for a given
input.

• A machine learning algorithm then takes these examples and produces a program that does the
job. The program produced by the learning algorithm may look very different from a typical hand-
written program. It may contain millions of numbers. If we do it right, the program works for new
cases as well as the ones we trained it on.

• Main goal of machine learning is to devise learning algorithms that do the learning automatically
without human intervention or assistance. The machine learning paradigm can be viewed as
"programming by example." Another goal is to develop computational models of human learning
process and perform computer simulations.

• The goal of machine learning is to build computer systems that can adapt and learn from their
experience.

• Algorithm is used to solve a problem on computer. An algorithm is a sequence of instruction. It


should carry out to transform the input to output. For example, for addition of four numbers is
carried out by giving four number as input to the algorithm and output is sum of all four numbers.
For the same task, there may be various algorithms. It is interested to find the most efficient one,
requiring the least number of instructions or memory or both.
• For some tasks, however, we do not have an algorithm.

How Machines Learn?

Machine learning typically follows three phases:

1. Training: A training set of examples of correct behavior is analyzed and some representation of the
newly learnt knowledge is stored. This is some form of rules.

2. Validation: The rules are checked and, if necessary, additional training is given. Sometimes
additional test data are used, but instead, a human expert may validate the rules, or some other
automatic knowledge - based component may be used. The role of the tester is often called the
opponent.

3. Application: The rules are used in responding to some new situation.

• Fig. 8.1.1 shows phases of ML.

Why Machine Learning is Important?

• Machine learning algorithms can figure out how to perform important tasks by generalizing from
examples.

• Machine learning provides business insight and intelligence. Decision makers are provided with
greater insights into their organizations. This adaptive technology is being used by global enterprises
to gain a competitive edge.

• Machine learning algorithms discover the relationships between the variables of a system (input,
output and hidden) from direct samples of the system.

• Following are some of the reasons:

1) Some tasks cannot be defined well, except by examples. For example: Recognizing people.
2) Relationships and correlations can be hidden within large amounts of data. To solve these
problems, machine learning and data these relationships.

3) Human designers often produce machines that do not work as well as desired in the environments
in which they are used.

4) The amount of knowledge available about certain tasks might be too large for explicit encoding by
humans.

5) Environments change time to time.

6) New knowledge about tasks is constantly being discovered by humans.

• Machine learning also helps us find solutions of many problems in computer vision, speech
recognition and robotics. Machine learning uses the theory of statistics in building mathematical
models, because the core task is making inference from a sample.

• Learning is used when:

1. Human expertise does not exist (navigating on Mars),

2. Humans are unable to explain their expertise (speech recognition)

3. Solution changes in time (routing on a computer network)

4. Solution needs to be adapted to particular cases (user biometrics)

Ingredients of Machine Learning

The ingredients of machine learning are as follows:

1. Tasks: The problems that can be solved with machine learning. A task is an abstract representation
of a problem. The standard methodology in machine learning is to learn one task at a time. Large
problems are broken into small, reasonably independent sub-problems that are learned separately
and then recombined.

• Predictive tasks perform inference on the current data in order to make predictions. Descriptive
tasks characterize the general properties of the data in the database.

2. Models: The output of machine learning. Different models are geometric models, probabilistic
models, logical models, grouping and grading.

• The model-based approach seeks to create a modified solution tailored to each new application.
Instead of having to transform your problem to fit some standard algorithm, in model-based machine
learning you design the algorithm precisely to fit your problem.

• Model is just made up of set of assumptions, expressed in a precise mathematical form. These
assumptions include the number and types of variables in the problem domain, which variables
affect each other, and what the effect of changing one variable is on another variable.

• Machine learning models are classified as: Geometric model, Probabilistic model and Logical
model.

3. Features: The workhorses of machine learning. A good feature representation is central to


achieving high performance in any machine learning task.
• Feature extraction starts from an initial set of measured data and builds derived values intended to
be informative, non redundant, facilitating the subsequent learning and generalization steps.

• Feature selection is a process that chooses a subset of features from the original features so that
the feature space is optimally reduced according to a certain criterion.

Types of Learning

• Learning is essential for unknown environments, i.e. when designer lacks the 10 me omniscience.
Learning simply means incorporating information from the training examples into the system.

• Learning is any change in a system that allows it to perform better the second time on repetition of
the same task or on another task drawn from the same population. One part of learning is acquiring
knowledge and new information; and the other part is problem-solving.

• Supervised and Unsupervised Learning are the different types of machine learning methods. A
computational learning model should be clear about the following aspects:

1. Learner: Who or what is doing the learning. For example: Program or algorithm.

2. Domain: What is being learned?

3. Goal: Why the learning is done?

4. Representation: The way the objects to be learned are represented.

5. Algorithmic technology: The algorithmic framework to be used.

6. Information source: The information (training data) the program uses for learning.

7. Training scenario: The description of the learning process.

Learning is constructing or modifying representation of what is being experienced. Learn means to


get knowledge of by study, experience or being taught.

• Machine learning is a scientific discipline concerned with the design and development of the
algorithm that allows computers to evolve behaviors based on empirical data, such as form sensors
data or database.

• Machine learning is usually divided into two main types: Supervised Learning and Unsupervised
Learning.

Why do Machine Learning?

1. To understand and improve efficiency of human learning.

2. Discover new things or structure that is unknown to humans (Example: Data mining).

3. Fill in skeletal or incomplete specifications about a domain.

Supervised Learning

• Supervised learning is the machine learning task of inferring a function from supervised training
data. The training data consist of a set of training examples. The task of the supervised learner is to
predict the output behavior of a system for any set of input values, after an initial training phase.

• Supervised learning in which the network is trained by providing it with input and matching output
patterns. These input-output pairs are usually provided by an external teacher.
• Human learning is based on the past experiences. A computer does not have experiences.

• A computer system learns from data, which represent some "past experiences" of an application
domain.

• To learn a target function that can be used to predict the values of a discrete class attribute, e.g.,
approve or not-approved and high-risk or low risk. The task is commonly called: Supervised learning,
Classification or inductive learning.

• Training data includes both the input and the desired results. For some examples the correct
results (targets) are known and are given in input to the model during the learning process. The
construction of a proper training, validation and test set is crucial. These methods are usually fast
and accurate.

• Have to be able to generalize: give the correct results when new data are given in input without
knowing a priori the target.

• Supervised learning is the machine learning task of inferring a function from supervised training
data. The training data consist of a set of training examples. In supervised learning, each example is a
pair consisting of an input object and a desired output value.

• A supervised learning algorithm analyzes the training data and produces an inferred function,
which is called a classifier or a regression function. Fig. 8.2.1 shows supervised learning process.

• The learned model helps the system to perform task better as compared to no learning.

• Each input vector requires a corresponding target vector.

Training Pair = (Input Vector, Target Vector)

• Fig. 8.2.2 shows input vector.


• Supervised learning denotes a method in which some input vectors are collected and presented to
the network. The output computed by the net-work is observed and the deviation from the expected
answer is measured. The weights are corrected according to the magnitude of the error in the way
defined by the learning algorithm.

• Supervised learning is further divided into methods which use reinforcement or error correction.
The perceptron learning algorithm is an example of supervised learning with reinforcement.

In order to solve a given problem of supervised learning, following steps are 1.8 performed:

1. Find out the type of training examples.

2. Collect a training set.

3. Determine the input feature representation of the learned function.

4. Determine the structure of the learned function and corresponding learning algorithm.

5. Complete the design and then run the learning algorithm on the collected training set.

6. Evaluate the accuracy of the learned function. After parameter adjustment and learning, the
performance of the resulting function should be measured on a test set that is separate from the
training set.

8.2.2 Unsupervised Learning

• The model is not provided with the correct results during the training. It can be used to cluster the
input data in classes on the basis of their statistical properties only. Cluster significance and labeling.

• The labeling can be carried out even if the labels are only available for a small number of objects
representative of the desired classes. All similar inputs patterns are grouped together as clusters.

• If matching pattern is not found, a new cluster is formed. There is no error feedback.

• External teacher is not used and is based upon only local information. It is also referred to as self-
organization.

• They are called unsupervised because they do not need a teacher or super-visor to label a set of
training examples. Only the original data is required to start the analysis.
• In contrast to supervised learning, unsupervised or self-organized learning does not require an
external teacher. During the training session, the neural network boy receives a number of different
input patterns, discovers significant features in these patterns and learns how to classify input data
into appropriate categories.

• Unsupervised learning algorithms aim to learn rapidly and can be used in real-time. Unsupervised
learning is frequently employed for data clustering, feature extraction etc.

• Another mode of learning called recording learning by Zurada is typically employed for associative
memory networks. An associative memory networks is designed by recording several idea patterns
into the networks stable states.

Difference between Supervised and Unsupervised Learning

Semi-supervised Learning

• Semi-supervised learning uses both labeled and unlabeled data to improve supervised learning.
The goal is to learn a predictor that predicts future test data better than the predictor learned from
the labeled training data alone.

• Semi-supervised learning is motivated by its practical value in learning faster, better and cheaper.

In many real world applications, it is relatively easy to acquire a large amount of unlabeled data x.

• For example, documents can be crawled from the Web, images can be obtained from surveillance
cameras, and speech can be collected from broadcast. However, their corresponding labels y for the
prediction task, such as sentiment orientation, intrusion detection and phonetic transcript, often
requires slow human annotation and expensive laboratory experiments.
• In many practical learning domains, there is a large supply of unlabeled data but limited labeled
data, which can be expensive to generate. For example: text processing, video-indexing,
bioinformatics etc.

• Semi-supervised Learning makes use of both labeled and unlabeled data for training, typically a
small amount of labeled data with a large amount of unlabeled data. When unlabeled data is used in
conjunction with a small amount of labeled data, it can produce considerable improvement in
learning accuracy.

• Semi-supervised learning sometimes enables predictive model testing at reduced cost.

• Semi-supervised classification: Training on labeled data exploits additional unlabeled data,


frequently resulting in a more accurate classifier.

• Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of
unlabeled data.

Reinforced Learnings

• User will get immediate feedback in supervised learning and no feedback from unsupervised
learning. But in the reinforced learning, you will get delayed scalar feedback.

• Reinforcement learning is learning what to do and how to map situations to actions. The learner is
not told which actions to take. Fig. 8.2.3 shows concept of reinforced learning.

• Reinforced learning is deals with agents that must sense and act upon their environment. It
combines classical Artificial Intelligence and machine learning techniques.

• It allows machines and software agents to automatically determine the ideal behavior within a
specific context, in order to maximize its performance. Simple reward feedback is required for the
agent to learn its behavior; this is known as the reinforcement signal.

• Two most important distinguishing features of reinforcement learning is trial-and-error and delayed
reward.

• With reinforcement learning algorithms an agent can improve its performance by using the
feedback it gets from the environment. This environmental feedback is and called the reward signal.

• Based on accumulated experience, the agent needs to learn which action to take in D as a given
situation in order to obtain a desired long term goal. Essentially actions that lead to long term
rewards need to reinforced. Reinforcement learning has connections with control theory, Markov
decision processes and game theory.

• Example of reinforcement learning: A mobile robot decides whether it should bow enter a new
room in search of more trash to collect or start trying to find its way air back to its battery recharging
station. It makes its decision based on how quickly and easily it has been able to find the recharger in
the past.

1. Elements of Reinforcement Learning

• Reinforcement learning elements are as follows:

1. Policy

2. Reward function

3. Value function

4. Model of the environment

• Fig. 8.2.4 shows elements of RL.

• Policy: Policy defines the learning agent behavior for given time period. It is a mapping from
perceived states environment to actions to be taken when in those states.

• Reward Function: Reward function is used to define a goal in a reinforcement learning problem. It
also maps each perceived state of the environment to a single number.

• Value function: Value functions specify what is good in the long run. The value of a state is the total
amount of reward an agent can expect to accumulate over the future, starting from that state.

• Model of the environment: Models are used for planning.

• Credit assignment problem: Reinforcement learning algorithms learn to generate an internal value
for the intermediate states as to how good they are in leading to the goal.

• The learning decision maker is called the agent. The agent interacts with the environment that
includes everything outside the agent.
• The agent has sensors to decide on its state in the environment and takes an action that modifies
its state.

• The reinforcement learning problem model is an agent continuously interacting with an


environment. The agent and the environment interact in a sequence of time steps. At each time step
t, the agent receives the state of the environment and a scalar numerical reward for the previous
action, and then the agent then selects an action.

• Reinforcement Learning is a technique for solving Markov decision problems.

• Reinforcement learning uses a formal framework defining the interaction between a learning agent
and its environment in terms of states, actions, and rewards. This framework is intended to be a
simple way of representing essential features of the artificial intelligence problem.

Difference between Supervised, Unsupervised and Reinforcement Learning

Regression

• Regression finds correlations between dependent and independent variables. If the desired output
consists of one or more continuous variable, then the task is called as regression.

• Therefore, regression algorithms help predict continuous variables such as house prices, market
trends, weather patterns, oil and gas prices etc.

• Fig. 8.3.1 shows regression.


• When the targets in a dataset are real numbers, the machine learning task is known as regression
and each sample in the dataset has a real-valued output or target.

• Regression analysis is a set of statistical methods used for the estimation of relationships between
a dependent variable and one or more independent variables. It can be utilized to assess the
strength of the relationship between variables and for modelling the future relationship between
them.

• The two basic types of regression are linear regression and multiple linear regression.

8.3.1 Linear Regression Models

• Linear regression is a statistical method that allows us to summarize and study relationships
between two continuous (quantitative) variables.

• The objective of a linear regression model is to find a relationship between the input variables and
a target variable.

1. One variable, denoted x, is regarded as the predictor, explanatory or buy independent variable.

2. The other variable, denoted y, is regarded as the response, outcome or dependent variable.

• Regression models predict a continuous variable, such as the sales made on a day predict
temperature of a city. Let's imagine that we fit a line with the training point that we have. If we want
to add another data point, but to fit it, we need to change existing model.

• This will happen with each data point that we add to the model; hence, linear regression isn't good
for classification models.

• Regression estimates are used to explain the relationship between one dependent variable and
one or more independent variables. Classification predicts categorical labels (classes), prediction
models continuous - valued functions. Classification is considered to be supervised learning.

• Classifies data based on the training set and the values in a classifying attribute and uses it in
classifying new data. Prediction means models continuous - valued functions, i.e. predicts unknown
or missing values.

• The regression line gives the average relationship between the two variables in mathematical form.

• For two variables X and Y, there are always two lines of regression.

• Regression line of X on Y Gives the best estimate for the value of X for any specific given values of Y:
X=a+bY

Where a= X - intercept

b = Slope of the line

X = Dependent variable

Y = Independent variable

• Regression line Y on X: Gives the best estimate for the value of Y for any specific given values of X:

Y = a + bx

Where a = Y - intercept

b = Slope of the line

Y = Dependent variable

X = Independent variable

• By using the least squares method (a procedure that minimizes the vertical deviations of plotted
points surrounding a straight line) we are able to construct a best fitting straight line to the scatter
diagram points and then formulate a regression equation in the form of :

y = a + bx

y = y +b(x-x)

• Regression analysis is the art and science of fitting straight lines to patterns of data. In a linear
regression model, the variable of interest ("dependent" variable) is predicted from k other variables
("independent" variables) using a linear equation. If Y denotes the dependent variable and X1,..., Xk,
are the independent variables, then the assumption is that the value of Y at time t in the data sample
is determined by the linear equation :

Y1 = β0 + β1 X1t + β2 X2t +...+ βk Xkt +εt

where the betas are constants and the epsilons are independent and identically distributed normal
random variables with mean zero.

• At each split point, the "error" between the predicted value and the actual values is squared to get
a "Sum of Squared Errors (SSE)". The split point errors across the variables are compared and the
variable/point yielding the lowest SSE is chosen as the root node/split point. This process is
recursively continued.

• Error function measures how much our predictions deviate from the desired answers.
Mean-squared error Jn = 1/n Σi = 1…n (yi – f(xi))2

Advantages:

a. Training a linear regression model is usually much faster than methods such as neural networks.

b. Linear regression models are simple and require minimum memory to implement.

c. By examining the magnitude and sign of the regression coefficients you can infer how predictor
variables affect the target outcome.

Least squares

• The method of least squares is about estimating parameters by minimizing the squared
discrepancies between observed data, on the one hand, and their expected al values on the other.

• Considering an arbitrary straight line, y = b0 +b1 x, is to be fitted through these data points. The
question is "Which line is the most representative"?

• What are the values of b0 and b1 such that the resulting line "best" fits the data points? But, what
goodness-of-fit criterion to use to determine among all possible combinations of nob b0 and b1?

• The Least Squares (LS) criterion states that the sum of the squares of errors is minimum. The least-
squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the
range [0,1].

• How to draw such a line based on data points observed? Suppose a imaginary line of y = a + bx.

• Imagine a vertical distance between the line and a data point E = Y - E(Y).

• This error is the deviation of the data point from the imaginary line, regression line. Then what is
the best values of a and b? A and b that minimizes the sum of such errors.
• Deviation does not have good properties for computation. Then why do we use squares of
deviation? Let us get a and b that can minimize the sum of squared deviations rather than the sum of
deviations. This method is called least squares.

• Least squares method minimizes the sum of squares of errors. Such a and b are called least squares
estimators i.e. estimators of parameters a and B.

• The process of getting parameter estimators (e.g., a and b) is called estimation. Lest squares
method is the estimation method of Ordinary Least Squares (OLS).

Disadvantages of least square

1. Lack robustness to outliers

2. Certain datasets unsuitable for least squares classification

3. Decision boundary corresponds to ML solution

Example 8.3.1 Fit a straight line to the points in the table. Compute m and b by least squares.

Solution: Represent in matrix form:


Multiple Regression

• Regression analysis is used to predict the value of one or more responses from a set of predictors.
It can also be used to estimate the linear association between the predictors and responses.
Predictors can be continuous or categorical or a mixture ben of both.

• If multiple independent variables affect the response variable, then the analysis calls for a model
different from that used for the single predictor variable. In a situation where more than one
independent factor (variable) affects the outcome of process, a multiple regression model is used.
This is referred to as multiple linear regression model or multivariate least squares fitting.

• Let z1 ; z2;:::; zt, be a set of r predictors believed to be related to a response variable Y. The linear
regression model for the jth sample unit has the form

Yj = β0 + β1 zj1 + β2 zj2 + ... + βr Zjr + εj

where ε is a random error and β1,i= 0, 1, ..., r are un-known regression coefficients.

• With n independent observations, we can write one model for each sample unit so that the model
is now

Y = Zβ + ε

where Y is n × 1, Z is n × (r+1),β is (r+1)× 1 and ε is n × 1

• In order to estimate ẞ, we take a least squares approach that is analogous to what we did in the
simple linear regression case.

• In matrix form, we can arrange the data in the following form:


where βj are the estimates of the regression coefficients.

Difference between Simple Regression and Multiple Regression

Bayesian Linear Regression

• Bayesian linear regression allows a useful mechanism to deal with insufficient data, or poor
distributed data. It allows user to put a prior on the coefficients and on the noise so that in the
absence of data, the priors can take over. A prior is a distribution on a parameter.

• If we could flip the coin an infinite number of times, inferring its bias would be easy by the law of
large numbers. However, what if we could only flip the coin a handful of times? Would we guess that
a coin is biased if we saw three heads in three flips, an event that happens one out of eight times
with unbiased coins? The MLE would overfit these data, inferring a coin bias of p =1.

• A Bayesian approach avoids overfitting by quantifying our prior knowledge that most coins are
unbiased, that the prior on the bias parameter is peaked around one-half. The data must overwhelm
this prior belief about coins.

• Bayesian methods allow us to estimate model parameters, to construct model forecasts and to
conduct model comparisons. Bayesian learning algorithms can calculate explicit probabilities for
hypotheses.

• Bayesian classifiers use a simple idea that the training data are utilized to calculate an observed
probability of each class based on feature values.

• When Bayesian classifier is used for unclassified data, it uses the observed probabilities to predict
the most likely class for the new features.

• Each observed training example can incrementally decrease or increase the estimated probability
that a hypothesis is correct.

• Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided by asserting a prior probability for each
candidate hypotheses and a probability distribution over observed data for each possible hypothesis.
• Bayesian methods canaccommodate hypotheses that make probabilistic predictions. New instances
can be classified by combining the predictions of multiple hypotheses, weighted by their
probabilities.

• Even in cases where Bayesian methods prove computationally intractable, they can provide a
standard of optimal decision making against which other practical bus () methods can be measured.

• Uses of Bayesian classifiers are as follows:

1. Used in text-based classification for finding spam or junk mail filtering.

2. Medical diagnosis.

3. Network security such as detecting illegal intrusion.

• The basic procedure for implementing Bayesian Linear Regression is :

i) Specify priors for the model parameter.

ii) Create a model mapping the training inputs to the training outputs.

iii) Have a Markov Chain Monte Carlo (MCMC) algorithm draw samples from the posterior
distributions for the parameters

Gradient Descent

• Goal: Solving minimization nonlinear problems through derivative information

• First and second derivatives of the objective function or the constraints play an important role in
optimization. The first order derivatives are called the gradient and the second order derivatives are
called the Hessian matrix.

• Derivative based optimization is also called nonlinear. Capable of determining search directions"
according to an objective function's derivative information.

• Derivative based optimization methods are used for:

1. Optimization of nonlinear neuro-fuzzy models

2. Neural network learning

3. Regression analysis in nonlinear models

• Basic descent methods are as follows:

1. Steepest descent

2. Newton-Raphson method

Gradient Descent:

• Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using
gradient descent, one takes steps proportional to the negative of the gradient of the function at the
current point.

• Gradient descent is popular for very large-scale optimization problems because it is easy to
implement, can handle black box functions, and each iteration is cheap.
• Given a differentiable scalar field f (x) and an initial guess x1, gradient descent iteratively moves the
guess toward lower values of "f" by taking steps in the direction of the negative gradient - f (x).

• Locally, the negated gradient is the steepest descent direction, i.e., the direction that x would need
to move in order to decrease "f" the fastest. The algorithm typically converges to a local minimum,
but may rarely reach a saddle point, or not move at all if x1 lies at a local maximum.

• The gradient will give the slope of the curve at that x and its direction will point to an increase in
the function. So we change x in the opposite direction to lower the function value:

Xk+ 1 = xk − λ f (xk)

The λ > 0 is a small number that forces the algorithm to make small jumps

Limitations of Gradient Descent:

• Gradient descent is relatively slow close to the minimum technically, its asymptotic rate of
convergence is inferior to many other methods.

• For poorly conditioned convex problems, gradient descent increasingly 'zigzags' as the gradients
point nearly orthogonally to the shortest direction to a minimum point

Steepest Descent:

• Steepest descent is also known as gradient method.

• This method is based on first order Taylor series approximation of objective function. This method
is also called saddle point method. Fig. 8.3.5 shows steepest descent method.

• The Steepest Descent is the simplest of the gradient methods. The choice of direction is where f
decreases most quickly, which is in the direction opposite to f (xi). The search starts at an arbitrary
point x0 and then go down the gradient, until reach close to the solution.
• The method of steepest descent is the discrete analogue of gradient descent, but the best move is
computed using a local minimization rather than computing a gradient. It is typically able to converge
in few steps but it is unable to escape local minima or plateaus in the objective function.

• The gradient is everywhere perpendicular to the contour lines. After each line minimization the
new gradient is always orthogonal to the previous step direction. Consequently, the iterates tend to
zig-zag down the valley in a very manner.

• The method of Steepest Descent is simple, easy to apply, and each iteration is fast. It also very
stable; if the minimum points exist, the method is guaranteed to locate them after at least an infinite
number of iterations.

Linear Classification Models

• A classification algorithm (Classifier) that makes its classification based on a linear predictor
function combining a set of weights with the feature vector.

• A linear classifier does classification decision based on the value of a linear combination of the
characteristics. Imagine that the linear classifier will merge into it's weights all the characteristics that
define a particular class.

• Linear classifiers can represent a lot of things, but they can't represent everything. The classic
example of what they can't represent is the XOR function.

Discriminant Function

• Linear Discriminant Analysis (LDA) is the most commonly used dimensionality reduction technique
in supervised learning. Basically, it is a preprocessing step for pattern classification and machine
learning applications. LDA is a powerful algorithm that can be used to determine the best separation
between two or more classes.

• LDA is a supervised learning algorithm, which means that it requires a labelled training set of data
points in order to learn the linear discriminant function.

• The main purpose of LDA is to find the line or plane that best separates data points belonging to
different classes. The key idea behind LDA is that the decision boundary should be chosen such that
it maximizes the distance between the means of the two classes while simultaneously minimizing the
variance within of each class's data or within-class scatter. This criterion is known as the Fisher nib
criterion.

• LDA is one of the most widely used machine learning algorithms due to its accuracy and flexibility.
LDA can be used for a variety of tasks such as classification, dimensionality reduction, and feature
selection.

• Suppose we have two classes and we need to classify them efficiently, then using LDA, classes are
divided as follows:
• LDA algorithm works based on the following steps:

a) The first step is to calculate the means and standard deviation of each feature.

b) Within class scatter matrix and between class scatter matrix is calculated

c) These matrices are then used to calculate the eigenvectors and eigenvalues.

d) LDA chooses the k eigenvectors with the largest eigenvalues to form a transformation matrix.

e) LDA uses this transformation matrix to transform the data into a new space with k dimensions.

f) Once the transformation matrix transforms the data into new space with k dimensions, LDA can
then be used for classification or dimensionality -reduction

• Benefits of using LDA:

a) LDA is used for classification problems.

b) LDA is a powerful tool for dimensionality reduction.

c) LDA is not susceptible to the "curse of dimensionality" like many other machine learning
algorithms.

Logistic Regression

• Logistic regression is a form of regression analysis in which the outcome variable is binary or
dichotomous. A statistical method used to model dichotomous or binary outcomes using predictor
variables.

• Logistic component: Instead of modeling the outcome, Y, directly, the method models the log odds
(Y) using the logistic function.

• Regression component: Methods used to quantify association between an outcome and predictor
variables. It could be used to build predictive models as a function of predictors.

• In simple logistic regression, logistic regression with 1 predictor variable.

Logistic Regression:

In[P(Y)/1-P(Y)] = β0 + β1 X1 + β2 X2 +...+ βk Xk

Y = β0 + β1 X1 + β2 X2 +...+ βk Xk + ε
• With logistic regression, the response variable is an indicator of some characteristic, that is, a 0/1
variable. Logistic regression is used to determine whether other measurements are related to the
presence of some characteristic, for example, whether certain blood measures are predictive of
having a disease.

• İf analysis of covariance can be said to be a t test adjusted for other variables, then logistic
regression can be thought of as a chi-square test for homogeneity of proportions adjusted for other
variables. While the response variable in a logistic regression is a 0/1 variable, the logistic regression
equation, which is a linear equation, does not predict the 0/1 variable itself.

• Fig. 8.4.2 shows Sigmoid curve for logistic regression.

• The linear and logistic probability models are :

Linear Regression :

p = a0 + a1 X1 +a2 X2 +...+ak Xk

Logistic Regression:

In [p/(1-p)] = b0 +b1 X1 +b2 X2 +...+bk Xk

• The linear model assumes that the probability p is a linear function of the regressors, while the
logistic model assumes that the natural log of the odds p/(1 - p) is a linear function of the regressors.

• The major advantage of the linear model is its interpretability. In the linear model, if a 1 is 0.05,
that means that a one-unit increase in X1 is associated with a 5% point increase in the probability
that Y is 1.

• The logistic model is less interpretable. In the logistic model, if b1 is 0.05, that means that a one-
unit increase in X1 is associated with a 0.05 increase in the log odds that Y is 1. And what does that
mean? I've never met anyone with any intuition for log odds.

Probabilistic Generative Model

• Generative models are a class of statistical models that generate new data instances. These models
are used in unsupervised machine learning to perform tasks such as probability and likelihood
estimation, modelling data points, and distinguishing between classes using these probabilities.

• Generative models rely on the Bayes theorem to find the joint probability. Generative models
describe how data is generated using probabilistic models. They predict P(y | x), the probability of y
given x, calculating the P(x,y), the probability of x and y.
Naive Bayes

• Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes'
theorem with strong independence assumptions between the features. It is highly scalable, requiring
a number of parameters linear in the number of variables (features/predictors) in a learning
problem.

• A Naive Bayes Classifier is a program which predicts a class value given a set of attributes.

• For each known class value,

1. Calculate probabilities for each attribute, conditional on the class value.

2. Use the product rule to obtain a joint conditional probability for the attributes.

3. Use Bayes rule to derive conditional probabilities for the class variable.

• Once this has been done for all class values, output the class with the highest probability.

• Naive bayes simplifies the calculation of probabilities by assuming that the probability of each
attribute belonging to a given class value is independent of all other attributes. This is a strong
assumption but results in a fast and effective method.

• The probability of a class value given a value of an attribute is called the conditional probability. By
multiplying the conditional probabilities together for each attribute for a given class value, we have a
probability of a data instance belonging to that class.

Conditional Probability

• Let A and B be two events such that P(A) > 0. We denote P(BIA) the probability of B given that A has
occurred. Since A is known to have occurred, it becomes the new sample space replacing the original
S. From this, the definition is,

P(B/A) = P(A∩B)/P(A)

OR

P(A ∩ B) = P(A) P(B/A)

• The notation P(B | A) is read "the probability of event B given event A". It is the probability of an
event B given the occurrence of the event A.

• We say that, the probability that both A and B occur is equal to the probability that A occurs times
the probability that B occurs given that A has occurred. We call P(B | A) the conditional probability of
B given A, i.e., the probability that B will occur given that A has occurred.

• Similarly, the conditional probability of an event A, given B by,

P(A/B) = P(A∩B)/P(B)

• The probability P(A | B) simply reflects the fact that the probability of an event A may depend on a
second event B. If A and B are mutually exclusive A ∩ B = and P(A | B) = 0.

• Another way to look at the conditional probability formula is :

P(Second/First) = P(First choice and second choice)/P(First choice)

• Conditional probability is a defined quantity and cannot be proven.


• The key to solving conditional probability problems is to:

1. Define the events.

2. Express the given information and question in probability notation.

3. Apply the formula.

Joint Probability

• A joint probability is a probability that measures the likelihood that two or more events will happen
concurrently.

• If there are two independent events A and B, the probability that A and B will occur is found by
multiplying the two probabilities. Thus for two events A and B, the special rule of multiplication
shown symbolically is :

P(A and B) = P(A) P(B).

• The general rule of multiplication is used to find the joint probability that two events will occur.
Symbolically, the general rule of multiplication is,

P(A and B) = P(A) P(B | A).

• The probability P(A ∩ B) is called the joint probability for two events A and B which intersect in the
sample space. Venn diagram will readily shows that

P(A ∩ B) = P(A) + P(B) - P (AUB)

Equivalently:

P(A ∩ B) = P(A) + P(B) - P(A ∩ B) ≤ P(A) + P(B)

• The probability of the union of two events never exceeds the sum of the event probabilities.

• A tree diagram is very useful for portraying conditional and joint probabilities. A tree diagram
portrays outcomes that are mutually exclusive.

Bayes Theorem

• Bayes' theorem is a method to revise the probability of an event given additional information.
Bayes's theorem calculates a conditional probability called a posterior or revised probability.

• Bayes' theorem is a result in probability theory that relates conditional probabilities. If A and B
denote two events, P(A | B) denotes the conditional probability of A occurring, given that B occurs.
The two conditional probabilities P(A | B) and P(B | A) are in general different.

• Bayes theorem gives a relation between P(A | B) and P(B | A). An important application of Bayes'
theorem is that it gives a rule how to update or revise the strengths of evidence-based beliefs in light
of new evidence a posterior.

• A prior probability is an initial probability value originally obtained before any additional
information is obtained.

• A posterior probability is a probability value that has been revised by using additional information
that is later obtained.
• Suppose that B1, B2, B3 ... Bn partition the outcomes of an experiment and that A is another event.
For any number, k, with 1 ≤ k ≤ n, we have the formula:

Difference between Generative and Discriminative Models

Maximum Margin Classifier: Support Vector Machine

Support Vector Machines (SVMs)are a set of supervised learning methods which learn from the and
used for dataset classification. SVM is a classifier derived from statistical learning theory by
Chervonenkis.

• An SVM is a kind of large-margin classifier: It is a vector space based machine learning method
where the goal is to find a decision boundary between two classes that is maximally far from any
point in the training data
• Given a set of training examples, each marked as belonging to one of two classes, an SVM
algorithm builds a model that predicts whether a new example falls into one class or the other.
Simply speaking, we can think of an SVM model as representing the examples as points in space,
mapped so that each of the examples of the separate classes are divided by a gap that is as wide as
possible.

• New examples are then mapped into the same space and classified to belong to the class based on
which side of the gap they fall on.

Two Class Problems

• Many decision boundaries can separate these two classes. Which one should we choose?

• Perceptron learning rule can be used to find any decision boundary between class 1 and class 2.

• The line that maximizes the minimum margin is a good bet. The model class of "hyper-planes with
a margin of m" has a low VC dimension if m is big.

• This maximum-margin separator is determined by a subset of the data points. Data points in this
subset are called "support vectors". It will be useful computationally if only a small fraction of the
data points are support vectors, because we use the support vectors to decide which side of the
separator a test case is on.

Example of Bad Decision Boundaries

• SVM are primarily two-class classifiers with the distinct characteristic that they aim to find the
optimal hyperplane such that the expected generalization error is minimized. Instead of directly
minimizing the empirical risk calculated from the training data, SVMs perform structural risk
minimization to achieve good generalization.

• The empirical risk is the average loss of an estimator for a finite set of data drawn from P. The idea
of risk minimization is not only measure the performance of an estimator by its risk, but to actually
search for the estimator that minimizes risk over distribution P. Because we don't know distribution P
we instead minimize empirical risk over a training dataset drawn from P. This general learning
technique is called empirical risk minimization.

• Fig. 8.6.3 shows empirical risk.


Good Decision Boundary: Margin Should Be Large

• The decision boundary should be as far away from the data of both classes as possible. If data
points lie very close to the boundary, the classifier may be consistent but is more "likely" to make
errors on new instances from the distribution. Hence, we prefer classifiers that maximize the minimal
distance of data points to the separator.

1. Margin (m): the gap between data points & the classifier boundary. The Margin is the minimum
distance of any sample to the decision boundary. If this hyperplane is in the canonical form, the
margin can be measured by the length of the weight vector.The margin is given by the projection of
the distance between these two points on the direction perpendicular to the hyperplane.

Margin of the separator is the distance between support vectors.

Margin (m)= 2/|w|||


2. Maximal margin classifier: a classifier in the family F that maximizes the margin. Maximizing the
margin is good according to intuition and PAC theory. Implies that only support vectors matter; other
training examples are ignorable.

Example 8.6.1 For the following figure find a linear hyperplane (decision boundary) that will separate
the data.

Solution:

1. Define what an optimal hyperplane is : maximize margin

2. Extend the above definition for non-linearly separable problems have a penalty term for
misclassifications
3. Map data to high dimensional space where it is easier to classify with linear of by decision
surfaces: reformulate problem so that data is mapped implicitly to this space

Key Properties of Support Vector Machines

1. Use a single hyperplane which subdivides the space into two half-spaces, one which is occupied by
Class 1 and the other by Class 2

2. They maximize the margin of the decision boundary using quadratic optimization techniques
which find the optimal hyperplane.

3. Ability to handle large feature spaces.

4. Overfitting can be controlled by soft margin approach

5. When used in practice, SVM approaches frequently map the examples to a higher dimensional
space and find margin maximal hyperplanes in the mapped space, obtaining decision boundaries
which are not hyperplanes in the original space.

6. The most popular versions of SVMs use non-linear kernel functions and map the attribute space
into a higher dimensional space to facilitate finding "good" linear decision boundaries in the
modified space.

SVM Applications

• SVM has been used successfully in many real-world problems,

1. Text (and hypertext) categorization

2. Image classification

3. Bioinformatics (Protein classification, Cancer classification)

4.Hand-written character recognition

5. Determination of SPAM email.

Limitations of SVM

1. It is sensitive to noise.

2. The biggest limitation of SVM lies in the choice of the kernel.

3. Another limitation is speed and size.

4. The optimal design for multiclass SVM classifiers is also a research area.

Soft Margin SVM

• For the very high dimensional problems common in text classification, sometimes the data are
linearly separable. But in the general case they are not, and even if they are, we might prefer a
solution that better separates the bulk of the data 1st while ignoring a few weird noise documents.

• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.

• A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing
misclassification.
• We penalize the crossover by looking at the number and distance of the misclassifications. This is a
trade off between the hyperplane violations and the margin size. The slack variables are bounded by
some set cost. The farther they are from the soft margin, the less influence they have on the
prediction.

• All observations have an associated slack variable,

1. Slack variable = 0 then all points on the margin.

2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane

3. C is the trade off between the slack variable penalty and the margin.

Comparison of SVM and Neural Networks

Example 8.6.2 From the following diagram, identify which data points (1, 2, 3, 4, 5) are support
vectors (if any), slack variables on correct side of classifier (if any) and slack variables on wrong side
of classifier (if any). Mention which point will have maximum penalty and why?

Solution:

• Data points 1 and 5 will have maximum penalty.


• Margin (m) is the gap between data points & the classifier boundary. The margin is the minimum
distance of any sample to the decision boundary. If this hyperplane is in the canonical form, the
margin can be measured by the length of the weight vector.

• Maximal margin classifier: A classifier in the family F that maximizes the margin. Maximizing the
margin is good according to intuition and PAC theory. Implies that only support vectors matter; other
training examples are ignorable.

• What if the training set is not linearly separable? Slack variables can be added to allow
misclassification of difficult or noisy examples, resulting margin called soft.

• A soft-margin allows a few variables to cross into the margin or over the hyperplane, allowing
misclassification.

• We penalize the crossover by looking at the number and distance of the misclassifications. This is a
trade off between the hyperplane violations and the margin size. The slack variables are bounded by
some set cost. The farther they are from the soft margin, the less influence they have on the
prediction.

• All observations have an associated slack variable

1. Slack variable = 0 then all points on the margin.

2.Slack variable > 0 then a point in the margin or on the wrong side of the hyperplane.

3. C is the tradeoff between the slack variable penalty and the margin.

Decision Tree

• A decision tree is a simple representation for classifying examples. Decision tree learning is one of
the most successful techniques for supervised classification learning.

• In decision analysis, a decision tree can be used to visually and explicitly represent decisions and
decision making. As the name goes, it uses a tree-like model of decisions.

• Learned trees can also be represented as sets of if-then rules to improve human readability.

• A decision tree has two kinds of nodes

1. Each leaf node has a class label, determined by majority vote of training examples reaching that
leaf.

2. Each internal node is a question on features. It branches out according to the answers.

• Decision tree learning is a method for approximating discrete-valued target functions. The learned
function is represented by a decision tree.

• A learned decision tree can also be re-represented as a set of if-then rules. Decision tree learning is
one of the most widely used and practical methods for inductive inference.

• It is robust to noisy data and capable of learning disjunctive expressions.

• Decision tree learning method searches a completely expressive hypothesis

Decision Tree Representation

• Goal: Build a decision tree for classifying examples as positive or negative instances of a concept
• Supervised learning, batch processing of training examples, using a preference bias.

• A decision tree is a tree where

a. Each non-leaf node has associated with it an attribute (feature).

b. Each leaf node has associated with it a classification (+ or -).

C. Each arc has associated with it one of the possible values of the attribute at the node from which
the arc is directed.

• Internal node denotes a test on an attribute. Branch represents an outcome of the test. Leaf nodes
represent class labels or class distribution.

• A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute
value, each branch represents an outcome of the test, and tree leaves represent classes or class
distributions. Decision trees can easily be converted to classification rules.

Decision Tree Algorithm

• To generate decision tree from the training tuples of data partition D.

Input:

1. Data partition (D)

2. Attribute list

3. Attribute selection method

Algorithm:

1. Create a node (N)

2. If tuples in D are all of the same class then

3. Return node (N) as a leaf node labeled with the class C.

4. If attribute list is empty then return N as a leaf node labeled with the majority class in D

5. Apply attribute selection method (D, attribute list) to find the "best" splitting criterion;

6. Label node N with splitting criterion;

7. If splitting attribute is discrete-valued and multiway splits allowed

8. Then attribute list -> attribute list -> splitting attribute

9. For (each outcome j of splitting criterion)

10. Let Dj be the set of data tuples in D satisfying outcome j;

11. If Dj is empty then attach a leaf labeled with the majority class in D to node N;

12. Else attach the node returned by Generate decision tree (Dj, attribute list) to node N;

13. End of for loop

14. Return N;
• Decision tree generation consists of two phases: Tree construction and pruning

• In tree construction phase, all the training examples are at the root. Partition examples recursively
based on selected attributes.

• In tree pruning phase, the identification and removal of branches that reflect noise or outliers.

• There are various paradigms that are used for learning binary classifiers which include:

1. Decision Trees

2. Neural Networks

3. Bayesian Classification

4. Support Vector Machines

Example 8.7.1 Using following feature tree, write decision rules for majority class.

Solution: Left Side: A feature tree combining two Boolean features. Each internal node or split is
labelled with a feature, and each edge emanating from a split is labelled with a feature value. Each
leaf therefore corresponds to a unique combination of feature values. Also indicated in each leaf is
the class distribution derived from the training set

• Right Side: A feature tree partitions the instance space into rectangular regions, one for each leaf.
• The leaves of the tree in the above figure could be labelled, from left to right, as ham - spam -
spam, employing a simple decision rule called majority class.

• Left side: A feature tree with training set class distribution in the leaves.

• Right side: A decision tree obtained using the majority class decision rule.

Appropriate Problem for Decision Tree Learning

• Decision tree learning is generally best suited to problems with the following characteristics:

1. Instances are represented by attribute-value pairs. Fixed set of attributes, and the attributes take a
small number of disjoint possible values.

2. The target function has discrete output values. Decision tree learning is appropriate for a boolean
classification, but it easily extends to learning functions with more than two possible output values.

3. Disjunctive descriptions may be required. Decision trees naturally represent disjunctive


expressions.

4. The training data may contain errors. Decision tree learning methods are robust to errors, both
errors in classifications of the training examples and errors in the attribute values that describe these
examples.

5. The training data may contain missing attribute values. Decision tree methods can be used even
when some training examples have unknown values.

6. Decision tree learning has been applied to problems such as learning to classify.
Advantages and Disadvantages of Decision Tree

Advantages:

1. Rules are simple and easy to understand.

2. Decision trees can handle both nominal and numerical attributes.

3. Decision trees are capable of handling datasets that may have errors.

4. Decision trees are capable of handling datasets that may have missing values.

5. Decision trees are considered to be a non parametric method.

6. Decision trees are self-explantory.

Disadvantages:

1. Most of the algorithms require that the target attribute will have only discrete values.

2. Some problem are difficult to solve like XOR.

3. Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a
continuous attribute.

4. Decision trees are prone to errors in classification problems with many class and se relatively small
number of training examples.

Random Forests

• Random forest is a famous system learning set of rules that belongs to the supervised getting to
know method. It may be used for both classification and regression issues in ML. It is based totally on
the concept of ensemble studying, that's a process of combining multiple classifiers to solve a
complex problem and to enhance the overall performance of the model.

• As the call indicates, "Random forest is a classifier that incorporates some of choice timber on
diverse subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each
tree and primarily based on most of the people's votes of predictions, and it predicts the very last
output.

• The more wider variety of trees within the forest results in better accuracy and prevents the hassle
of overfitting.

How Does Random Forest Algorithm Work?

• Random forest works in two-section first is to create the random woodland by combining N
selection trees and second is to make predictions for each tree created inside the first segment.

• The working technique may be explained within the below steps and diagram:

Step 1: Select random K statistics points from the schooling set.

Step 2: Build the selection trees associated with the selected information points (Subsets).

Step 3: Choose the wide variety N for selection trees which we want to build.

Step 4: Repeat step 1 and 2.


Step 5: For new factors, locate the predictions of each choice tree and assign the new records factors
to the category that wins most people's votes.

• The working of the set of rules may be higher understood by the underneath example:

• Example: Suppose there may be a dataset that includes more than one fruit photo. So, this dataset
is given to the random wooded area classifier. The dataset is divided into subsets and given to every
decision tree. During the training section, each decision tree produces a prediction end result and
while a brand new statistics point occurs, then primarily based on the majority of consequences, the
random forest classifier predicts the final decision. Consider the underneath picture:

Applications of Random Forest

There are specifically 4 sectors where random forest normally used:

SAT

1. Banking: Banking zone in general uses this algorithm for the identification of loan danger.

2. Medicine: With the assistance of this set of rules, disorder traits and risks of the disorder may be
recognized.

3. Land use: We can perceive the areas of comparable land use with the aid of this algorithm.

4. Marketing: Marketing tendencies can be recognized by the usage of this algorithm.

Advantages of Random Forest

Random forest is able to appearing both classification and regression responsibilities.

• It is capable of managing large datasets with high dimensionality.


• It enhances the accuracy of the version and forestalls the overfitting trouble.

Disadvantages of Random Forest

• Although random forest can be used for both class and regression responsibilities, it isn't extra
appropriate for regression obligations.

Two Marks Questions with Answers


Q.1 Define learning.

Ans.: Learning is a phenomenon and process which has manifestations of various aspects. Learning
process includes gaining of new symbolic knowledge and development of cognitive skills through
instruction and practice. It is also discovery of new facts and theories through observation and
experiment.

Q.2 Define machine learning.

Ans.: A computer program is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.

Q.3 What is an influence of information theory on machine learning?

Ans.: Information theory is measures of entropy and information content. Minimum description
length approaches to learning. Optimal codes and their relationship to optimal training sequences for
encoding a hypothesis.

Q.4 What is meant by target function of a learning program?

Ans.: Target function is a method for solving a problem that an AI algorithm parses its training data
to find. Once an algorithm finds its target function, that function can be used to predict results. The
function can then be used to find output data related to inputs for real problems where, unlike
training sets, outputs are not included.

Q.5 Define useful perspective on machine learning.

Ans.: One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge held
by the learner.

Q.6 Describe the issues in machine learning?

Ans.: Issues of machine learning are as follows:

• What learning algorithms to be used?

• How much training data is sufficient?

• When and how prior knowledge can guide the learning process?

• What is the best strategy for choosing a next training experience?

• What is the best way to reduce the learning task to one or more function approximation problems?

• How can the learner automatically alter its representation to improve its learning ability?
Q.7 What is decision tree?

Ans.:

• Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree.

• A decision tree is a tree where each node represents a feature (attribute), each link(branch)
represents a decision(rule) and each leaf represents an outcome (categorical or continues value).

• A decision tree or a classification tree is a tree in which each internal node is labeled with an input
feature. The arcs coming from a node labeled with a feature are labeled with each of the possible
values of the feature.

Q.8 What are the nodes of decision tree?

Ans.: A decision tree has two kinds of nodes

1. Each leaf node has a class label, determined by majority vote of training examples reaching that
leaf.

2. Each internal node is a question on features. It branches out according to the answers.

• Decision tree learning is a method for approximating discrete-valued target functions. The learned
function is represented by a decision tree

Q.9 Why tree pruning useful in decision tree induction?

Ans.: When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers. Tree pruning methods address this problem of overfitting the data. Such
methods typically use statistical measures to remove the least reliable branches.

Q.10 What is tree pruning?

Ans.: Tree pruning attempts to identify and remove such branches, with the goal of improving
classification accuracy on unseen data

Q.11 What is RULE POST-PRUNING?

Ans.:

• It is method for finding high accuracy hypotheses.

• Rule post-pruning involves the following steps:

1. Infer decision tree from training set

2. Convert tree to rules - one rule per branch

3. Prune each rule by removing preconditions that result in improved estimatedaccuracy

4. Sort the pruned rules by their estimated accuracy and consider them in this sequence when
classifying unseen instances

Q.12 Why convert the decision tree to rules before pruning?

Ans.:
• Converting to rules allows distinguishing among the different contexts in which a decision node is
used.

• Converting to rules removes the distinction between attribute tests that occur near the root of the
tree and those that occur near the leaves.

• Converting to rules improves readability. Rules are often easier for to if understand

Q.13 What do you mean by least square method?

Ans.: Least squares is a statistical method used to determine a line of best fit by minimizing the sum
of squares created by a mathematical function. A "square" is determined by squaring the distance
between a data point and the regression line or mean value of the data set

Q.14 What is linear discriminant function?

Ans.: LDA is a supervised learning algorithm, which means that it requires a labelled training set of
data points in order to learn the Linear Discriminant function.

Q.15 What is a support vector in SVM?

Ans.: Support vectors are data points that are closer to the hyperplane and influence the position
and orientation of the hyperplane. Using these support vectors, we maximize the margin of the
classifier.

Q.16 What is support vector machines?

Ans.: A Support Vector Machine (SVM) is a supervised machine learning model that uses
classification algorithms for two-group classification problems. After giving an SVM model sets of
labeled training data for each category, they're able to categorize new text.

Q.17 Define logistic regression.

Ans.: Logistic regression is supervised learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

Q.18 List out types of machine learning.

Ans.: Types of machine learnings are supervised, semi-supervised, unsupervised and Reinforcement
Learning.

Q.19 What is random forest?

Ans.: Random forest is an ensemble learning technique that combines multiple decision trees,
implementing the bagging method and results in a robust model with low variance.

Q.20 What are the five popular algorithms of machine learning?

Ans.: Popular algorithms are Decision Trees, Neural Networks (back propagation), Probabilistic
networks, Nearest Neighbor and Support vector machines.

Q.21 What is the function of 'Supervised Learning'?

Ans.: Function of 'Supervised Learning' are Classifications, Speech recognition, Regression, Predict
time series and Annotate strings.

Q.22 What are the advantages of Naive Bayes?


Ans.: In Naïve Bayes classifier will converge quicker than discriminative models like logistic
regression, so you need less training data. The main advantage is that it can't learn interactions
between features.

Q.23 What is regression?

Ans.: Regression is a method to determine the statistical relationship between a dependent variable
and one or more independent variables.

Q.24 Explain linear and non-linear regression model.

Ans.: In linear regression models, the dependence of the response on the regressors is defined by a
linear function, which makes their statistical analysis mathematically tractable. On the other hand, in
nonlinear regression models, this dependence is defined by a nonlinear function, hence the
mathematical difficulty in their analysis.

Q.25 What is regression analysis used for?

Ans.: Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent (target) and independent variable (s) (predictor). This technique is
used for forecasting, time series modelling and finding the causal effect relationship between the
variables.

Q.26 List two properties of logistic regression.

Ans.:

1. The dependent variable in logistic regression follows Bernoulli Distribution. 2. Estimation is done
through maximum likelihood.

Q.27 What is the goal of logistic regression?

Ans.: The goal of logistic regression is to correctly predict the category of outcome for individual
cases using the most parsimonious model. To accomplish this goal, a model is created that includes
all predictor variables that are useful in predicting the response variable.

Q.28 Define supervised learning.

Ans.: Supervised learning in which the network is trained by providing it with input and matching
output patterns. These input-output pairs are usually provided by an external teacher.

You might also like