0% found this document useful (0 votes)
96 views23 pages

ML QB

The document outlines the objectives and units of study for the course CP1104 Machine Learning Techniques. The course aims to introduce students to basic machine learning concepts and techniques including supervised learning, unsupervised learning, and probabilistic models. It will cover topics such as linear models, decision trees, dimensionality reduction, and graphical models over 5 units spanning 45 class periods. Required textbooks and references are also listed.

Uploaded by

aiswarya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views23 pages

ML QB

The document outlines the objectives and units of study for the course CP1104 Machine Learning Techniques. The course aims to introduce students to basic machine learning concepts and techniques including supervised learning, unsupervised learning, and probabilistic models. It will cover topics such as linear models, decision trees, dimensionality reduction, and graphical models over 5 units spanning 45 class periods. Required textbooks and references are also listed.

Uploaded by

aiswarya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

CP1104 Machine Learning Techniques Department of CSE 2022-23

CP1104
MACHINE LEARNING TECHNIQUES

OBJECTIVES
❖ To introduce students to the basic concepts and techniques of Machine Learning.
❖ To have a thorough understanding of the Supervised and Unsupervised learning techniques
❖ To study the various probability-based learning techniques
❖ To understand graphical models of machine learning algorithms
UNIT I
INTRODUCTION
9
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron –
Design a Learning System – Perspectives and Issues in Machine Learning – Concept Learning
Task – Concept Learning as Search – Finding a Maximally Specific Hypothesis – Version
Spaces and the Candidate Elimination Algorithm – Linear Discriminants – Perceptron – Linear
Separability – Linear Regression.

UNIT II
LINEAR MODELS

Multi-layer Perceptron – Going Forwards – Going Backwards: Back Propagation Error – Multi-
layer Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-
Propagation – Radial Basis Functions and Splines – Concepts – RBF Network – Curse of
Dimensionality – Interpolations and Basis Functions – Support Vector Machines.

UNIT III
TREE AND PROBABILISTIC MODELS

Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and
Regression Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine
Classifiers – Probability and Learning – Data into Probabilities – Basic Statistics – Gaussian
Mixture Models – Nearest Neighbor Methods – Unsupervised Learning – K means Algorithms –
Vector Quantization – Self Organizing Feature Map

UNIT IV
DIMENSIONALITY REDUCTION AND EVOLUTIONARY MODELS

Dimensionality Reduction – Linear Discriminant Analysis – Principal Component Analysis –


Factor Analysis – Independent Component Analysis – Locally Linear Embedding – Isomap –
Least Squares Optimization – Evolutionary Learning – Genetic algorithms – Genetic Offspring: -
Genetic Operators – Using Genetic Algorithms – Reinforcement Learning – Overview – Getting
Lost Example – Markov Decision Process

UNIT V
GRAPHICAL MODELS

St. Joseph’s College of Engineering 1


CP1104 Machine Learning Techniques Department of CSE 2022-23

Markov Chain Monte Carlo Methods – Sampling – Proposal Distribution – Markov Chain Monte
Carlo – Graphical Models – Bayesian Networks – Markov Random Fields – Hidden Markov
Models – Tracking Methods

TOTAL : 45 PERIODS
TEXT BOOKS
1 Ethem Alpaydin, ―Introduction to Machine Learning 3e (Adaptive Computation and Machine
Learning Series)‖, Third Edition, MIT Press, 2014
2 Jason Bell, ―Machine learning – Hands on for Developers and Technical Professionals‖, First
Edition, Wiley, 2014

REFERENCE BOOKS
1. Peter Flach, ―Machine Learning: The Art and Science of Algorithms that Make Sense of
Data‖, First Edition, Cambridge University Press, 2012.
2. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective‖, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014.
3. Tom M Mitchell, ―Machine Learning‖, First Edition, McGraw Hill Education, 2013.

St. Joseph’s College of Engineering 2


CP1104 Machine Learning Techniques Department of CSE 2022-23

UNIT-I INTRODUCTION
PART-A
1. Give precise definition of learning.
A computer program is said to learn from experience E with respect to some class of tasks
T and performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.

2. Find T, P and E for checkers learning problem, handwriting recognition learning problem,
robot driving learning problem
Checkers learning problem:
• Task T: playing checkers
• Performance measure P: percent of games won against opponents
• Training experience E: playing practice games against itself.
Handwriting recognition learning problem:
• Task T: recognizing and classifying handwritten words within images
• Performance measure P: percent of words correctly classified
• Training experience E: a database of handwritten words with given classifications
Robot driving learning problem:
• Task T: driving on public four-lane highways using vision sensors
• Performance measure P: average distance travelled before an error
• Training experience E: a sequence of images and steering commands recorded while
observing a human driver

3. Define machine learning.


Machine learning is a subfield of computer science (CS) and artificial intelligence (AI) that
deals with the construction and study of systems that can learn from data, rather than follow only
explicitly programmed instructions. (i.e. Machine learning is the science of getting computers to
act without being explicitly programmed)

4. List some disciplines and examples of their influence on machine learning.


• Artificial intelligence
• Bayesian methods
• Computational complexity theory
• Control theory
• Information theory
• Philosophy
• Psychology and neurobiology
• Statistics

5. List some successful applications of machine learning.


• Learning to recognize spoken words
• Learning to drive an autonomous vehicle
• Learning to classify new astronomical structures
• Learning to play world-class backgammon.

6. What is the difference between artificial intelligence and machine learning methods?

St. Joseph’s College of Engineering 3


CP1104 Machine Learning Techniques Department of CSE 2022-23

Artificial intelligence, and is sometimes known as symbolic processing because the


computer manipulates symbols that reflect the environment. In contrast, machine learning methods
are sometimes called sub-symbolic, since no symbols or symbolic manipulation are involved.

7. List the types of machine learning. (Jan 2018)


• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Evolutionary learning

8. What is supervised learning?


A training set of examples with the correct responses (targets) are provided and, based on
this training set, the algorithm generalizes to respond correctly to all possible inputs. This is also
called learning from exemplars.

9. What is unsupervised learning?


Correct responses are not provided, instead the algorithm tries to identify similarities
between the inputs so that inputs that have something in common are categorized together.
The statistical approach to unsupervised learning is known as density estimation.

10. What is reinforcement learning?


This is somewhere between supervised and unsupervised learning. The algorithm gets told
when the answer is wrong, but does not get told how to correct it. It has to explore and try out
different possibilities until it works out how to get the answer right. Reinforcement learning
sometimes called learning with critic because of this monitor that scores the answer, but does not
suggest improvements.

11. What is evolutionary learning? (Jan 2018)


Biological evaluation can be seen as a learning process: biological organisms adapt to
improve their survival rates and chance of having offspring in their environment.

12. What are the basic functions of a neuron?


• Receive signals (or information).
• Integrate incoming signals (to determine whether or not the information should be passed
along).
• Communicate signals to target cells (other neurons or muscles or glands).
These neuronal functions are reflected in the anatomy of the neuron.

13. Draw the basic structure of neuron

St. Joseph’s College of Engineering 4


CP1104 Machine Learning Techniques Department of CSE 2022-23

14. List the steps in designing a learning system.


• Choosing the training experience
• Choosing the target function
• Choosing the representation for the target function
• Choosing a function approximation algorithm.
Estimating training values
Adjusting the weights
• The final design

15. State Hebb’s rule.


Hebb’s rule says that the changes in the strength of synaptic connections are proportional
to the correlation in the firing of the two connecting neurons. So if two neurons consistency fire
simultaneously, then any connection between them will change in strength, becoming stronger.
However, if the two neurons never fire simultaneously, the connection between them will die
away.

16. How McCulloch and Pitts modelled a neuron?


McCulloch and Pitts modelled a neuron AS:
• A set of weighted inputs wi which correspond to the synapses
• An adder that sums the input signals (equivalent to the membrane of the cell that collects
electrical charge)
• An activation function (initially a threshold function) that decides whether the neuron fires
for the current inputs.

17. List the limitations of MCP model.


• Weights and thresholds are analytically determined. Cannot learn
• Very difficult to minimize size of a network
• What about non-discrete and/or non-binary tasks?

18. What is Perceptron? (Jan 2018)


The Perceptron is nothing more than a collection of McCulloch and Pitts neurons together
with a set of inputs and some weights to fasten the inputs to the neurons.

19. What is Linear Discriminant?


Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a
method used in statistics, pattern recognition and machine learning to find a linear combination of
features that characterizes or separates two or more classes of objects or events. The resulting
combination may be used as a linear classifier, or, more commonly, for dimensionality reduction
before later classification.

20. What is linear reparability?


Consider two-input patterns (X1, X2) being classified into two classes (as shown in figure).
Each point with either symbol of X or O represents a pattern with a set of values (X1, X2) . Each
pattern is classified into one of two classes. Notice that these classes can be separated with a single
line L. They are known as linearly separable patterns. Linear separability refers to the fact that
classes of patterns with n-dimensional vector x = (x1,x2….,xn)can be separated with a single
decision surface. In the case above, the line L represents the decision surface.

St. Joseph’s College of Engineering 5


CP1104 Machine Learning Techniques Department of CSE 2022-23

21. What is linear regression?


In statistics, linear regression is a linear approach for modeling the relationship between a
scalar dependent variabley and one or more explanatory variables (or independent variables)
denoted X. The case of one explanatory variable is called simple linear regression. For more than
one explanatory variable, the process is called multiple linear regression.

22. State the assumptions in a linear regression model.


It is assumed that there is a linear relationship between the dependent and independent variables.
It is known as the ‘linearity assumption’. Normality assumption: It is assumed that the error terms,
ε(i), are normally distributed. Zero mean assumption: It is assumed that the residuals have a mean
value of zero.

23. What is feature engineering? How do you apply it in the process of modelling?
Feature engineering is the process of transforming raw data into features that better represent the
underlying problem to the predictive models, resulting in improved model accuracy on unseen
data.

24. What value is the sum of the residuals of a linear regression close to? Justify.
The sum of the residuals of a linear regression is 0. Linear regression works on the assumption
that the errors (residuals) are normally distributed with a mean of 0, i.e.
Y = βT X + ε
Here, Y is the target or dependent variable,
β is the vector of the regression coefficient,
X is the feature matrix containing all the features as the columns,
ε is the residual term such that ε ~ N(0,σ2).
So, the sum of all the residuals is the expected value of the residuals times the total number of
data points. Since the expectation of residuals is 0, the sum of all the residual terms is zero.

25. What is concept learning?


Concept Learning: Acquiring the definition of a general category from given sample positive and
negative training examples of the category. Concept Learning can seen as a problem of searching
through a predefined space of potential hypotheses for the hypothesis that best fits the training
examples.

PART-B

St. Joseph’s College of Engineering 6


CP1104 Machine Learning Techniques Department of CSE 2022-23

1. Define learning. What are the three features of learning? Explain the three features of learning
with the following problem.
a) Checkers learning problem
b) Handwriting recognition learning problem
c) Robot driving learning problem.
2 a) List and explain some successful applications of machine learning?
b) List some disciplines and examples of their influence on machine learning.
3. Explain the various steps in designing a learning system.
4.Elaborate the Perspectives and issues in machine learning
5. Explain concept learning task with the example of ENJOYSPORT.
6. Explain concept learning as search with the example of ENJOYSPORT.
7. Write FIND-S algorithm. Trace the algorithm with example.
8. Explain version space and List-Then-Eliminate algorithm with an example.
9. Explain CANDIDATE ELIMINATION learning algorithm with an example.
10. List and explain the various remarks on version spaces and candidate elimination.
11. Explain supervised learning with the concept of regression and classification.
12. Draw the neuron structure and explain the various parts.
13. Explain McCulloch-Pitts model with an example. Explain the limitations of MCP model.
14. Explain Perceptron learning algorithm with an example.

UNIT-II LINEAR MODELS

1. What is multilayer perceptron (MLP)?


A multilayer perceptron (MLP) is a class of feed forward Artificial Neural Network. The
MLP consists of three or more layers (an input and an output layer with one or more hidden layers)
of nonlinearly-activating nodes making it a deep neural network. Since MLPs are fully connected,
each node in one layer connects with a certain weight wij to every node in the following layer.

2. What are the 2 steps of MLP training?


Training the MLP consists of two parts:
• Working out what the outputs are for the given inputs and the current weights
• Updating the weights according to the error, which is a function of the difference between
the outputs and the targets.

3. What is biases in MLP?


Just like in the perceptron case, we need to include a bias input to each neuron. We do this
in the same way, by having extra input that is permanently set to -1, and adjusting the weights to
each neuron as part of the training. Thus each neuron in the network has 1 extra input, with fixed
value.

4. What do you mean by going forwards and backwards through the network?
Training the MLP consists of two parts:
• Working out what the outputs are for the given inputs and the current weights
• Updating the weights according to the error, which is a function of the difference between
the outputs and the targets. These are generally known as going forwards and backwards
through the network.

5. What do you mean by back propagation?

St. Joseph’s College of Engineering 7


CP1104 Machine Learning Techniques Department of CSE 2022-23

Back propagation is a method used in artificial neural networks to calculate the error
contribution of each neuron after a batch of data (e.g. in image recognition, multiple images) is
processed. This is used by an enveloping optimization algorithm to adjust the weight of each
neuron, completing the learning process for that case.

6. What is backward propagation of errors?


Technically it calculates the gradient of the loss function. It is commonly used in the
gradient descent optimization algorithm. It is also called backward propagation of errors, because
the error is calculated at the output and distributed back through the network layers.

7. What is activation function?


The activation function of a node defines the output of that node given an input or set of
inputs. A standard computer chip circuit can be seen as a digital network of activation functions
that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the behavior of the linear
perceptron in neural networks. However, only nonlinear activation functions allow such networks
to compute nontrivial problems using only a small number of nodes. In artificial neural networks
this function is also called the transfer function.

8. Why can’t we do it without activating the input signal?


If we do not apply activation function then the output signal would simply be a simple
linear function. A linear function is just a polynomial of one degree. Now, a linear equation is easy
to solve but they are limited in their complexity and have less power to learn complex functional
mappings from data. A Neural Network without Activation function would simply be a linear
regression Model, which has limited power and does not performs good most of the times.

9. Why do we need Non-Linearities?


Non-linear functions are those which have degree more than one and they have a curvature
when we plot a Non-Linear function. Now we need a Neural Network Model to learn and represent
almost anything and any arbitrary complex function which maps inputs to outputs. Neural-
Networks are considered Universal Function Approximators. It means that they can compute and
learn any function at all. Almost any process we can think of can be represented as a functional
computation in Neural Networks.
Hence it all comes down to this, we need to apply a Activation function f(x) so as to make
the network more powerful and add ability to it to learn something complex and complicated form
data and represent non-linear complex arbitrary functional mappings between inputs and outputs.
Hence using a nonlinear Activation we are able to generate non-linear mappings from inputs to
outputs.

10. List the various activation function.


• Step function
• Linear function
• Sigmoid function
• Tanh function

11. Write about Sigmoid Activation function.


It is an activation function of form f(x) = 1 / 1 + exp (-x). Its Range is between 0 and 1. It
is an S - shaped curve. It is easy to understand.

St. Joseph’s College of Engineering 8


CP1104 Machine Learning Techniques Department of CSE 2022-23

12. What are the limitations of sigmoid function?


• Vanishing gradient problem
• Secondly, its output isn’t zero centered. It makes the gradient updates go too far in
different directions. 0 < output < 1, and it makes optimization harder.
• Sigmoid saturate and kill gradients.
• Sigmoid have slow convergence.

13. What is Tanh function?


Hyperbolic Tangent function- Tanh: its mathematical formula is f(x) = 1 — exp (-2x) / 1 +
exp (-2x). Now its output is zero centered because its range in between -1 to 1 i.e -1 < output < 1.
Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid
function. But still it suffers from Vanishing gradient problem.

14. What are the two ways to implement BP learning?


• Sequential mode
• Batch mode

15. What is sequential mode?


In this mode of BP learning, adjustments are made to the free parameters of the network
on an example-by-example basis. The sequential mode is also suited for pattern classification
problem. Sequential mode also referred as on-line mode or stochastic mode.

16. What is batch mode?


In this mode of BP learning, adjustments are made to the free parameters of the network
on an epoch-by-epoch basis, where each epoch consists of the entire set of training examples. This
batch mode is best suited for nonlinear regression.

17. How many weights are there in MLP with one hidden layer?
For the MLP with one hidden layer there are (m+1) x n + (n x1) x p weights where m, n, p
are the number of nodes in the input, hidden and output layers respectively. The extra +1 come
from the bias node, which also have adjustable weights.

18. What are the various parameters that the hidden unit depends on?
The best number of hidden units depends in a complex way on many factors, including:
• The number of training patterns
• The numbers of input and output units
• The amount of noise in the training data
• The complexity of the function or classification to be learned
• The type of hidden unit activation function
• The training algorithm

19. What is a radial basis function network?


A Radial basis function network is an artificial neural network that uses radial basis
functions as activation functions. The output of the network is a linear combination of radial basis
functions of the inputs and neuron parameters. Radial basis function networks have many uses,
including function approximation, time series prediction, classification, and system control.

20. Draw the architecture of a radial basis function network.

St. Joseph’s College of Engineering 9


CP1104 Machine Learning Techniques Department of CSE 2022-23

An input vector x is used as input to all radial basis functions, each with different
parameters. The output of the network is a linear combination of the outputs from radial basis
functions.

21. What is SVM?


In machine learning, support vector machines (SVMs, also support vector networks) are
supervised learning models with associated learning algorithms that analyze data used for
classification and regression analysis. Given a set of training examples, each marked as belonging
to one or the other of two categories, an SVM training algorithm builds a model that assigns new
examples to one category or the other, making it a non-probabilisticbinarylinear classifier.

22. What is a spline? (Jan 2018)


Splines are functions which match given values at the points x1,...,xNT, and have continuous
derivatives up to some order at the knots, or the points x2,...,xNT1. Cubic splines are most
common. In this case the function is represented by a cubic polynomial within each interval and
has continuous first and second derivatives at the knots.

23. State the applications of radial basis function network. (Jan 2018)
A radial basis function network is an artificial neural network that uses radial basis functions as
activation functions. The output of the network is a linear combination of radial basis functions
of the inputs and neuron parameters. Radial basis function networks have many uses, including
function approximation, time series prediction, classification, and system control.
24. Explain the bias-variance trade-off.
Bias refers to the difference between the values predicted by the model and the real values. It is an
error. One of the goals of an ML algorithm is to have a low bias.Variance refers to the sensitivity
of the model to small fluctuations in the training dataset. Another goal of an ML algorithm is to
have low variance.For a dataset that is not exactly linear, it is not possible to have both bias and
variance low at the same time. A straight line model will have low variance but high bias, whereas
a high-degree polynomial will have low bias but high variance. So, there is a trade-off between the
two; the ML specialist has to decide, based on the assigned problem, how much bias and variance
can be tolerated. Based on this, the final model is built.
25. What’s the “kernel trick” and how is it useful?

St. Joseph’s College of Engineering 10


CP1104 Machine Learning Techniques Department of CSE 2022-23

Kernel trick plays a huge role in application of SVM for non-linear separable classification
problems. The idea is to map the non-linear separable data-set into a higher dimensional space
where we can find a hyperplane that can separate the samples.

PART-B
1. Explain how the XOR problem can be solved by an MLP?
2. Explain the multilayer perceptron algorithm in detail.
3. Explain the following with respect to MLP.
a) Initializing the weights
b) Different output activation functions
4. Explain the following in detail
a) Sequential and batch training
b) Local minima
5. What are the choices that can be made about the network in order to use it for real problems?
Explain in detail.
6. How regression problem can be solved using MLP? Explain with example.
7. Explain classification with MLP in detail.
8. Explain time series prediction problem with MLP in detail.
9. Explain the auto associative network in detail.
10. Explain Radial Basis Function (RBF) network & training RBFN in detail.
11. Explain the following in detail
a) The curse of dimensionality
b) Interpolation and basis function.
12. Explain support vector machine in detail.

UNIT-III TREE AND PROBABILISTIC MODEL


PART-A

1. What is the idea of decision tree?


The idea of a decision tree is that we break classification down into set of choices about
each feature in turn, starting at the root (base) of the tree and processing down to the leaves, where
we receive the classification decision.

2.List the advantages and disadvantages of decision tree.


• Are simple to understand and interpret. People are able to understand decision tree models
after a brief explanation.
• Have value even with little hard data. Important insights can be generated based on experts
describing a situation (its alternatives, probabilities, and costs) and their preferences for
outcomes.
• Allow the addition of new possible scenarios.
• Help determine worst, best and expected values for different scenarios.
• Use a white box model. If a given result is provided by a model.
• Can be combined with other decision techniques.
Disadvantages of decision trees:
• For data including categorical variables with different number of levels, information gain
in decision trees is biased in favor of those attributes with more levels.

St. Joseph’s College of Engineering 11


CP1104 Machine Learning Techniques Department of CSE 2022-23

• Calculations can get very complex, particularly if many values are uncertain and/or if many
outcomes are linked.

3. Define (information) Entropy


Information entropy, which describes the amount of impurity in a set of features. The
entropy H of a set of probabilities pi is
H (p) = - Σ pi log2 pi
where the logarithm is base 2

4. Write function for computing the entropy.


defcalc_entropy (p):
if p ! =0:
return –p * log2 (p)
else:
return 0

5. Define information gain.


Information gain is defined as the entropy of the whole set minus the entropy when a
particular feature is chosen. This is defined by
Gain (S,F) =Entropy(S)−∑ f ∈Values (F )|Sf| / |S|·Entropy (Sf)
Where S is the set of examples, F is a possible feature out of the set of all possible ones, and |Sf|
is a count of the number of members of S that have value f for feature F.

6. List some decision tree algorithms.


There are many specific decision-tree algorithms. Notable ones include:
• ID3 (Iterative Dichotomiser 3)
• C4.5 (successor of ID3)
• CART (Classification And Regression Tree)
• CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when
computing classification trees.
• MARS: extends decision trees to handle numerical data better.
• Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as
splitting criteria, corrected for multiple testing to avoid overfitting. This approach results
in unbiased predictor selection and does not require pruning

7. Write note on ID3 algorithm.

• Calculate the entropy of every attribute using the data set


• Split the set S into subsets using the attribute for which the resulting entropy
(after splitting) is minimum (or, equivalently, information gain is maximum)
• Make a decision tree node containing that attribute
• Recurse on subsets using remaining attributes.

8. What is CART?
Decision trees used in data mining are of two main types:
• Classification tree analysis is when the predicted outcome is the class to which the data
belongs.

St. Joseph’s College of Engineering 12


CP1104 Machine Learning Techniques Department of CSE 2022-23

• Regression tree analysis is when the predicted outcome can be considered a real number
(e.g. the price of a house, or a patient's length of stay in a hospital).
The term Classification And Regression Tree (CART) analysis is an umbrella term used to refer
to both of the above procedures, first introduced by researcher Breiman. Trees used for regression
and trees used for classification have some similarities - but also some differences, such as the
procedure used to determine where to split.

9. What is Gini impurity?


Used by the CART (classification and regression tree) algorithm, Gini impurity is a
measure of how often a randomly chosen element from the set would be incorrectly labeled if it
was randomly labeled according to the distribution of labels in the subset. Gini impurity can be
computed by summing the probability piof an item with label I being chosen times the probability
1-piof a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node
fall into a single target category.

10. What is the goal of ensemble methods?


The goal of ensemble methods is to combine the predictions of several base estimators built
with a given learning algorithm in order to improve generalizability / robustness over a single
estimator.

11. What are the two families of ensemble methods?


Two families of ensemble methods are usually distinguished:
• In averaging methods, the driving principle is to build several estimators independently
and then to average their predictions. On average, the combined estimator is usually better
than any of the single base estimator because its variance is reduced.
Examples:Bagging methods, Forests of randomized trees
• By contrast, inboosting methods, base estimators are built sequentially and one tries to
reduce the bias of the combined estimator. The motivation is to combine several weak
models to produce a powerful ensemble.
Examples: AdaBoost, Gradient Tree Boosting
12. What is Boosting?
Boosting is a general ensemble method that attempts to create a strong classifier from a
number of weak classifiers.
This is done by building a model from the training data, then creating a second model that
attempts to correct the errors from the first model. Models are added until the training set is
predicted perfectly or a maximum number of models are added.

13. What is stumping?


There is a very extreme form of boosting that is applied to trees. It goes by the descriptive
name of stumping. Stumping consists of simply taking the root of the tree and using as the decision
maker.

14. What is bagging?


Bootstrap aggregating, also called bagging, is a machine learning ensemblemeta-algorithm
designed to improve the stability and accuracy of machine learning algorithms used in statistical
classification and regression. It also reduces variance and helps to avoid overfitting. Although it is
usually applied to decision tree methods, it can be used with any type of method. Bagging is a
special case of the model averaging approach.

St. Joseph’s College of Engineering 13


CP1104 Machine Learning Techniques Department of CSE 2022-23

15 What is expectation–maximization (EM) algorithm?


In statistics, an expectation–maximization (EM) algorithm is an iterative method to find
maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models,
where the model depends on unobserved latent variables. The EM iteration alternates between
performing an expectation (E) step, which creates a function for the expectation of the log-
likelihood evaluated using the current estimate for the parameters, and a maximization (M) step,
which computes parameters maximizing the expected log-likelihood found on the E step. These
parameter-estimates are then used to determine the distribution of the latent variables in the next
E step.

16. What is Gaussian mixture model?


A Gaussian mixture model is a probabilistic model that assumes all the data points are
generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
One can think of mixture models as generalizing k-means clustering to incorporate information
about the covariance structure of the data as well as the centers of the latent Gaussians.

17. What is kernel smoother?


A kernel smoother is a statistical technique for estimating a real valued functionby using
its noisy observations, when no parametric model for this function is known. The estimated
function is smooth, and the level of smoothness is set by a single parameter.
This technique is most appropriate for low-dimensional (p < 3) data visualization purposes.
Actually, the kernel smoother represents the set of irregular data points as a smooth line or surface.

18. What is k-NN algorithm?


In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric
method used for classification and regression. In both cases, the input consists of the k closest
training examples in the feature space. The output depends on whether k-NN is used for
classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a
majority vote of its neighbors, with the object being assigned to the class most common among its
k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply
assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average
of the values of its k nearest neighbors.

19. What is vector quantization?


Vector quantization (VQ) is a classical quantization technique from signal processing that
allows the modeling of probability density functions by the distribution of prototype vectors. It
was originally used for data compression. It works by dividing a large set of points (vectors) into
groups having approximately the same number of points closest to them. Each group is represented
by its centroid point, as in k-means and some other clustering algorithms.

20. What is SOFM?


A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial
neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional
(typically two-dimensional), discretized representation of the input space of the training samples,
called a map, and is therefore a method to do dimensionality reduction. Self-organizing maps differ

St. Joseph’s College of Engineering 14


CP1104 Machine Learning Techniques Department of CSE 2022-23

from other artificial neural networks as they apply competitive learning as opposed to error-
correction learning (such as back propagation with gradient descent), and in the sense that they use
a neighborhood function to preserve the topological properties of the input space.

21. What is ensemble learning? (Jan 2018)


Ensemble learning is the process by which multiple models, such as classifiers or experts, are
strategically generated and combined to solve a particular computational intelligence problem.
Ensemble learning is primarily used to improve the (classification, prediction, function
approximation, etc.) performance of a model, or reduce the likelihood of an unfortunate selection
of a poor one.
22. Distinguish between classification and regression. (Jan 2018)
Classification predictive modeling problems are different from regression predictive modeling
problems.
• Classification is the task of predicting a discrete class label.
• Regression is the task of predicting a continuous quantity.

23. Explain the steps in making a decision tree.


a. Take the entire data set as input
b. Calculate entropy of the target variable, as well as the predictor attributes
c. Calculate your information gain of all attributes (we gain information on sorting different
objects from each other)
d. Choose the attribute with the highest information gain as the root node
e. Repeat the same procedure on every branch until the decision node of each branch is
finalized
24. How do you build a random forest model?
A random forest is built up of a number of decision trees. If you split the data into different
packages and make a decision tree in each of the different groups of data, the random forest brings
all those trees together.Steps to build a random forest model:
a. Randomly select 'k' features from a total of 'm' features where k << m
b. Among the 'k' features, calculate the node D using the best split point
c. Split the node into daughter nodes using the best split
d. Repeat steps two and three until leaf nodes are finalized
e. Build forest by repeating steps one to four for 'n' times to create 'n' number of trees

25. How can you avoid the overfitting in your model?


Overfitting refers to a model that is only set for a very small amount of data and ignores the bigger
picture. There are three main methods to avoid overfitting:
a. Keep the model simple—take fewer variables into account, thereby removing some of the
noise in the training data
b. Use cross-validation techniques, such as k folds cross-validation
c. Use regularization techniques, such as LASSO, that penalize certain model parameters if
they're likely to cause overfitting

PART-B
1. Explain ID3 algorithm with example.
2. Explain classification and regression trees with example.

St. Joseph’s College of Engineering 15


CP1104 Machine Learning Techniques Department of CSE 2022-23

3. Make a decision tree that computes the logical AND function. How does it compare to the
perceptron solution?
4. Explain AdaBoost algorithm with example.
5. Explain the different ways to combine classifiers.
6. Explain the following
a) Turning data into probabilities
b) Naïve Bayes classifier
7. Explain the important statistical concepts in detail.
8. Explain the Gaussian Mixture models in detail.
9. Explain nearest neighbor methods in detail.
10. Explain the k-means algorithm in detail
11. Explain vector quantization in detail.
12. Explain SOFM in detail.

UNIT-IV DIMENSIONALITY REDUCTION


PART-A

1. Define dimensionality reduction (Jan 2018)


In machine learning and statistics, dimensionality reduction or dimension reduction is the
process of reducing the number of random variables under consideration via obtaining a set of
principal variables. It can be divided into feature selection and feature extraction.

2. List the different ways to do dimensionality reduction


• Feature selection
• Feature derivation
• Clustering

3. Define feature selection.


Feature selection means looking through the features that are available and seeing whether
or not they are actually useful, i.e correlated to the output variables.

4. Define feature derivation


Feature derivation means deriving new features from the old ones, generally by applying
transforms to the dataset that simply change the axes 9coordinate system) of the graph by moving
and rotating them, which can be written simply as a matrix that we apply to the data.

5. What is Linear Discriminant Analysis (LDA)?


Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction
technique in the pre-processing step for pattern-classification and machine learning applications.
The goal is to project a dataset onto a lower-dimensional space with good class-separability in
order avoid overfitting (“curse of dimensionality”) and also reduce computational costs.

6. What is the idea of principal component analysis?


The idea of principal component analysis is that it is a direction in the data with the largest
variation. The algorithm first centres the data by subtracting off the mean, and then chooses the
direction with the largest variation and places an axis in that direction, and then looks at the

St. Joseph’s College of Engineering 16


CP1104 Machine Learning Techniques Department of CSE 2022-23

variation that remains and finds another axis that it orthogonal to the first and covers as much of
the remaining variation as possible. It then iterates this until it has run out of possible axes.

7. Define spectral decomposition.


WhenA=SΛS−1is a real-symmetric (or Hermitian) matrix, itseigenvectors can be chosen
orthonormal and henceS=Qis orthogonal (or unitary).Thus,A=QΛQT, which is called thespectral
decompositionofA.

8. Find the spectral decomposition forA=(3 22 3)


The characteristic equation forAisλ2−6λ+ 5 = 0. Thus the eigenvalues ofAareλ1= 1 andλ2=
5. Forλ1= 1, the eigenvector isv1= (1−1). Forλ2= 5, theeigenvector isv2=(11). Thus the spectral
decomposition of A=(√2/2√2/2−√2/2√2/2) (1 00 5) (√2/2−√2/2√2/2√2/2).

9. What is kernel PCA?


In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is
an extension of principal component analysis (PCA) using techniques of kernel methods. Using a
kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

10. What is factor analysis?


The idea of factor analysis is to ask whether the data that is observed can be explained by
a smaller number of uncorrelated factors ot latent variables.

11. What is genetic algorithm?


The genetic algorithm is a method for solving both constrained and unconstrained
optimization problems that is based on natural selection, the process that drives biological
evolution. The genetic algorithm repeatedly modifies a population of individual solutions. At each
step, the genetic algorithm selects individuals at random from the current population to be parents
and uses them to produce the children for the next generation. Over successive generations, the
population "evolves" toward an optimal solution.

12. What are three main types of rules the genetic algorithm uses at each step to create the
next generation from the current population?
The genetic algorithm uses three main types of rules at each step to create the next generation
from the current population:
• Selection rules select the individuals, called parents that contribute to the population at the
next generation.
• Crossover rules combine two parents to form children for the next generation.
• Mutation rules apply random changes to individual parents to form children.

13. Enumerate the difference between genetic algorithm and classical optimization
algorithm.
The genetic algorithm differs from a classical, derivative-based, optimization algorithm in
two main ways, as summarized in the following table.

Classical Algorithm Genetic Algorithm


Generates a single point at each iteration. The Generates a population of points at each iteration.
sequence of points approaches an optimal The best point in the population approaches an
solution. optimal solution.

St. Joseph’s College of Engineering 17


CP1104 Machine Learning Techniques Department of CSE 2022-23

Classical Algorithm Genetic Algorithm


Selects the next point in the sequence by a Selects the next population by computation which
deterministic computation. uses random number generators.

14. List the methods of various selection of chromosomes.


• Fitness proportionate selection (SCX) The individual is selected on the basis of fitness.
The probability of an individual to be selected increases with the fitness of the individual
greater or less than its competitor's fitness.
• Simplex Crossover (SPX)
• Boltzmann selection
• Tournament selection
• Rank selection
• Steady state selection
• Truncation selection
• Local selection

15. What is truncation selection?


Truncation selection is the simplest and arguably least useful selection strategy. Truncation
selection simply retains the fittest x% of the population. These fittest individuals are duplicated so
that the population size is maintained. For example, we might select the fittest 25% from a
population of 100 individuals. In this case we would create four copies of each of the 25 candidates
in order to maintain a population of 100 individuals. This is an easy selection strategy to implement
but it can result in premature convergence as less fit candidates are ruthlessly culled without being
given the opportunity to evolve into something better. Nevertheless, truncation selection can be an
effective strategy for certain problems.

16. What is fitness-proportionate selection?


A better approach to selection is to give every individual a chance of being selected to
breed but to make fitter candidates more likely to be chosen than weaker individuals. This is
achieved by making an individual's survival probability a function of its fitness score. Such
strategies are known as fitness-proportionate selection.

17. What are the three basic tasks to be performed when you want to apply genetic
algorithm?
• Encode possible solutions as strings
• Choose a suitable fitness functions
• Choose suitable genetic operators.

18. What is crossover?


The crossover operator is analogous to reproduction and biological crossover. In this more
than one parent is selected and one or more off-springs are produced using the genetic material of
the parents. Crossover is usually applied in a GA with a high probability

19. What is Markov decision processes (MDPs)


Markov decision processes provide a mathematical framework for modeling decision
making in situations where outcomes are partly random and partly under the control of a decision

St. Joseph’s College of Engineering 18


CP1104 Machine Learning Techniques Department of CSE 2022-23

maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic
programming and reinforcement learning.

20. What is partially observable Markov decision process (POMDP)?


A partially observable Markov decision process (POMDP) is a generalization of a Markov
decision process (MDP). A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot directly observe the
underlying state. Instead, it must maintain a probability distribution over the set of possible states,
based on a set of observations and observation probabilities, and the underlying MDP.

21. What two requirements should a problem satisfy in order to be suitable for solving it by
a Genetic Algorithm (GA)?
GA can only be applied to problems that satisfy the following requirements: The fitness function
can be well–defined. Solutions should be decomposable into steps (building blocks) which could
be then encoded as chromosomes.

22. Explain the Confusion Matrix with Respect to Machine Learning Algorithms.
A confusion matrix (or error matrix) is a specific table that is used to measure the performance of
an algorithm. It is mostly used in supervised learning; in unsupervised learning, it’s called the
matching matrix. The confusion matrix has two parameters:
• Actual
• Predicted
It also has identical sets of features in both of these dimensions.

23. What is Deep Learning?


Deep learning is a subset of machine learning that involves systems that think and learn like
humans using artificial neural networks. The term ‘deep’ comes from the fact that you can have
several layers of neural networks.

One of the primary differences between machine learning and deep learning is that feature
engineering is done manually in machine learning. In the case of deep learning, the model
consisting of neural networks will automatically determine which features to use (and which not
to use).

24. How to formulate a basic reinforcement Learning problem?


Some key terms that describe the elements of a RL problem are:
Environment: Physical world in which the agent operates
State: Current situation of the agent
Reward: Feedback from the environment
Policy: Method to map agent’s state to actions
Value: Future reward that an agent would receive by taking an action in a particular state

25. What are some most used Reinforcement Learning algorithms?


Q-learning and SARSA (State-Action-Reward-State-Action) are two commonly used model-free
RL algorithms. They differ in terms of their exploration strategies while their exploitation
strategies are similar. While Q-learning is an off-policy method in which the agent learns the value
based on action are derived from the another policy, SARSA is an on-policy method where it learns
the value based on its current action derived from its current policy.

St. Joseph’s College of Engineering 19


CP1104 Machine Learning Techniques Department of CSE 2022-23

PART-B
1. Explain the three different ways to do dimensionality reduction in detail.
2. Explain Linear Discriminant analysis in detail.
3. Explain principal component analysis algorithm in detail.
4. Explain Factor analysis in detail.
5. Use the LDA on the Iris dataset. Compare the results with using PCA, which is not supervised
and will not therefore be able to find the same space.
6. Explain Levenberg-Marquardt algorithm in detail.
7. Explain Conjugate gradients algorithm in detail.
8. Explain the three basic approaches in search techniques.
9. Explain evolutionary learning in detail.
10. Explain reinforcement learning in detail

UNIT-V GRAPHICAL MODELS


PART-A

1. What is linear congruential generator (LCG)?


A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-
randomized numbers calculated with a discontinuous piecewise linear equation. The method
represents one of the oldest and best-known pseudorandom number generator algorithms.[1] The
theory behind them is relatively easy to understand, and they are easily implemented and fast,
especially on computer hardware which can provide modulo arithmetic by storage-bit truncation.
The generator is defined by the recurrence relation:
xn+1 = (axn + c) mod m
Where a,c and m are parameters that have to be chosen.

2. What is Mersenne prime?


In mathematics, a Mersenne prime is a prime number that is one less than a power of two.
That is, it is a prime number of the form Mn = 2n − 1 for some integer n.

3. What is Box-Muller transform?


The Box–Muller transform, by George Edward Pelham Box and Mervin Edgar Muller is a
pseudo-random number sampling method for generating pairs of independent, standard, normally
distributed (zero expectation, unit variance) random numbers, given a source of uniformly
distributed random numbers.

4. What are the two forms that the Box–Muller transform is commonly expressed?
The Box–Muller transform is commonly expressed in two forms. The basic form as given
by Box and Muller takes two samples from the uniform distribution on the interval [0, 1] and maps
them to two standard, normally distributed samples. The polar form takes two samples from a
different interval, [−1, +1], and maps them to two normally distributed samples without the use of
sine or cosine functions.
5. What is chain?
In probabilistic term, a chain is a sequence of possible states, where the probability of being
in state s at time t is a function of the previous state.

6. What is Markov chain?

St. Joseph’s College of Engineering 20


CP1104 Machine Learning Techniques Department of CSE 2022-23

Markov chain is a chain with the Markov property. That is the probability at time t depends
only on the state at t-1.

7. What is graphical model?


A graphical model or probabilistic graphical model (PGM) is a probabilistic model for
which a graph expresses the conditional dependence structure between random variables. They are
commonly used in probability theory, statistics—particularly Bayesian statistics—and machine
learning.

8. What is Metropolis–Hastings Algorithm?


The Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for
obtaining a sequence of random samples from a probability distribution for which direct sampling
is difficult. This sequence can be used to approximate the distribution (e.g., to generate a
histogram), or to compute an integral (such as an expected value). Metropolis–Hastings and other
MCMC algorithms are generally used for sampling from multi-dimensional distributions,
especially when the number of dimensions is high

9. What is Gibbs sampling?


Gibbs sampling is one MCMC technique suitable for the task. The idea in Gibbs sampling
is to generate posterior samples by sweeping through each variable (or block of variables) to
sample from its conditional distribution with the remaining variables fixed to their current values.

10. What is Transition Probabilities?


The one-step transition probability is the probability of transitioning from one state to
another in a single step. The Markov chain is said to be time homogeneous if the transition
probabilities from one state to another are independent of time index.

11. What is ergodic in Markov chain?


A Markov chain is called an ergodic chain if it is possible to gofrom every state to every
state (not necessarily in one move).Ergodic Markov chains are also called irreducible.A Markov
chain is called a regular chain if some power of thetransition matrix has only positive elements.

12. What is Markov chain Monte Carlo?


Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from
a probability distribution based on constructing a Markov chain that has the desired distribution as
its equilibrium distribution. The state of the chain after a number of steps is then used as a sample
of the desired distribution. The quality of the sample improves as a function of the number of steps.
13. What is simulated annealing?
Simulated annealing is a method for solving unconstrained and bound-constrained
optimization problems. The method models the physical process of heating a material and then
slowly lowering the temperature to decrease defects, thus minimizing the system energy.

14. What are the two types of graphical models?


Generally, probabilistic graphical models use a graph-based representation as the
foundation for encoding a complete distribution over a multi-dimensional space and a graph that
is a compact or factorized representation of a set of independences that hold in the specific
distribution. Two branches of graphical representations of distributions are commonly used,
namely, Bayesian networks and Markov random fields. Both families encompass the properties of

St. Joseph’s College of Engineering 21


CP1104 Machine Learning Techniques Department of CSE 2022-23

factorization and independences, but they differ in the set of independences they can encode and
the factorization of the distribution that they induce

15. What is Markov Random Fields.?


A Markov Random Field (MRF) is a graphical model of a joint probability distribution. It
consists of an undirected graph in which the nodes represent random variables
16. What is hidden Markovmodel?
Hidden Markov Model (HMM) is a statisticalMarkov model in which the system being
modeled is assumed to be a Markov process with unobserved (i.e. hidden) states.The hidden
Markov model can be represented as the simplest dynamic Bayesian network.

17. What is HMM Viterbi algorithm?


The Viterbi Algorithm is a dynamic programmingalgorithm for finding the most likely
sequence of hidden states – called the Viterbi path – that results in a sequence of observed events,
especially in the context of Markov information sources and hidden Markov models.

18. What are the two methods of tracking?


• Kalman filter
• Particle filter

19. What is Kalman filter?


A Kalman filter is an optimal estimator- i.e. infers parameters of interest from indirect,
inaccurate and uncertain observations. It is recursive so that new measurements can be processed
as they arrive. (batch processing where all data must be present).

20. What is particle filters?


Particle filters or Sequential Monte Carlo (SMC) methods are a set of genetic, Monte Carlo
algorithms used to solve filtering problems arising in signal processing and Bayesian statistical
inference. The filtering problem consists of estimating the internal states in dynamical systems
when partial observations are made, and random perturbations are present in the sensors as well as
in the dynamical system.

21. Define Bayesian network. (Jan 2018)


A Bayesian network (also known as a Bayes network, belief network, or decision network) is a
probabilistic graphical model that represents a set of variables and their conditional dependencies
via a directed acyclic graph (DAG)

22. Define Gibbs Sampling Algorithm.


The Gibbs Sampling algorithm is an approach to constructing a Markov chain where the
probability of the next sample is calculated as the conditional probability given the prior
sample.Samples are constructed by changing one random variable at a time, meaning that
subsequent samples are very close in the search space, e.g. local. As such, there is some risk of the
chain getting stuck.

23. What is Metropolis-Hastings Algorithm?


The Metropolis-Hastings Algorithm is appropriate for those probabilistic models where we cannot
directly sample the so-called next state probability distribution, such as the conditional probability
distribution used by Gibbs Sampling.

St. Joseph’s College of Engineering 22


CP1104 Machine Learning Techniques Department of CSE 2022-23

24. What is the complexity of Viterbi algorithm?


Viterbi algorithm is a dynamic programming approach to find the most probable sequence of
hidden states given the observed data, as modeled by a HMM. Without dynamic programming, it
becomes an exponential problem as there are exponential number of possible sequences for a given
observation.

25. What order of Markov assumption does n-grams model make?


An n-grams model makes order n-1 Markov assumption. This assumption implies: given the
previous n-1 words, probability of word is independent of words prior to words.

PART-B

1. Explain sampling in MCMC.


2. Explain Monte Carlo or Bust in detail.
3. Explain the proposal distribution in detail.
4. Explain Metropolis-Hastings in detail.
5 Explain Bayesian network in detail.
6. Explain Markov random field in detail.
7. Explain HMM-Viterbi algorithm in detail.
8. Explain HMM Baum-Welch (Forward-Backward) algorithm in detail.
9. Explain the particle filter in detail
10. Explain Kalman filter algorithm in detail.

St. Joseph’s College of Engineering 23

You might also like