0% found this document useful (0 votes)
22 views13 pages

Aiml FML Answer Key

The document outlines the syllabus and examination structure for the 'Fundamentals of Machine Learning' course, detailing the topics covered in both Part A and Part B, including key concepts such as classifiers, Bayesian networks, support vector machines, and random forests. It also includes specific questions and tasks related to these topics, emphasizing the application of various machine learning techniques and theories. The document serves as a comprehensive guide for students in the AIML department to prepare for their assessments.

Uploaded by

Chandiran S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Aiml FML Answer Key

The document outlines the syllabus and examination structure for the 'Fundamentals of Machine Learning' course, detailing the topics covered in both Part A and Part B, including key concepts such as classifiers, Bayesian networks, support vector machines, and random forests. It also includes specific questions and tasks related to these topics, emphasizing the application of various machine learning techniques and theories. The document serves as a comprehensive guide for students in the AIML department to prepare for their assessments.

Uploaded by

Chandiran S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Subject / Code: Fundamentals of Machine Learning/ AI PC502

Year/Semester/Section: III/ V
Dept: AIML Max. Marks: 75
[Bloom’s Taxonomy: K1- Remember, K2- Understand, K3-Apply,
K4-Analysis]
PART – A
Answer all the questions (10*2 = 20)
1. How Machine Perceptron works? (CO1,K1)
Perceptron is a single-layer neural network linear or a Machine Learning algorithm used for supervised
learning of various binary classifiers. It works as an artificial neuron to perform computations by learning elements
and processing them for detecting the business intelligence and capabilities of the input data.

2. Illustrate the linear regression model provides a sloped straight line representing the relationship between
the variables (CO1,K1)

2. Write the formula for KNN Rule? (CO2,K4)


The set of the k nearest neighbors of x as Sx. Formally Sx is defined as Sx⊆D s.t. |Sx|=k and
∀(x′,y′)∈D∖Sx, dist(x,x′)≥max(x″,y″)∈Sxdist(x,x″), (i.e. every point in D but not in Sx is at least as far away
from x as the furthest point in Sx).

3 . .Differentiate classification and regression (CO2,K1)


5. How Independent Component Analysis works on neural network ? (CO3,K2)

6. Difference between CNN and RNN (CO3,K2)


The main differences between CNNs and RNNs include the following: CNNs are commonly used to solve
problems involving spatial data, such as images. RNNs are better suited to analyzing temporal and sequential data,
such as text or videos. CNNs and RNNs have different architectures
7. Define ensemble method . (CO4,K2)
Ensemble methods are techniques that aim at improving the accuracy of results in models by combining multiple
models instead of using a single model. The combined models increase the accuracy of the results significantly.

8 Label the parts of a convolutional neural network with a neat diagram?


(CO4,K1)

9. What is the difference between bagging and boosting in machine learning? (CO5,K2)
10. List the purpose of cascading (CO5,K2)

A cascade neural network is used to detect and isolate three types of faults. Eight machine learning classifiers are
compared in the task of fault isolation. Real irradiance profiles with clouds are used to train the classifiers. The classifiers
are fed with present and past information to increase accuracy.

PART – B
Answer all the questions (5*11 = 55)
UNIT I
0. a) Explain the Classifier, Discriminant Function and decision Surface in detail. (11)

Classifiers:

A classifier is a statistical function that assigns a data point to one of a predefined set of classes. Classifiers are used
in a wide variety of applications, including machine learning, pattern recognition, and image analysis.
There are many different types of classifiers, but they all share the same basic goal: to accurately predict the class
of a new data point. Some common types of classifiers include:

● Linear classifiers: These classifiers make predictions based on a linear combination of the input features.
● Nonlinear classifiers: These classifiers can make predictions on data that is not linearly separable.
● Probabilistic classifiers: These classifiers assign a probability to each class, indicating the likelihood that
the data point belongs to that class.

Discriminant Functions:
A discriminant function is a function that measures the distance between a data point and a class. The discriminant
function is used by the classifier to make a prediction about the class of the data point.
The discriminant function for a particular class is typically defined as the log of the posterior probability of the class
given the data point. This means that the discriminant function is high for data points that are likely to belong to the
class and low for data points that are unlikely to belong to the class.

Decision Surfaces:
A decision surface is a boundary in the input space that separates one class from another. The decision surface is
determined by the discriminant function.
Data points on one side of the decision surface are predicted to belong to one class, while data points on the other
side of the decision surface are predicted to belong to the other class.
The shape of the decision surface depends on the type of classifier and the input data. For example, the decision
surface for a linear classifier is a hyperplane, while the decision surface for a nonlinear classifier can be more
complex.
0. b) Discuss about the Bayesian Belief Network concept (11) (CO1, K4)
Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a problem which
has uncertainty. We can define a Bayesian network as:

"A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional
dependencies using a directed acyclic graph."

It is also called a Bayes network, belief network, decision network, or Bayesian model.

Bayesian networks are probabilistic, because these networks are built from a probability distribution, and also use
probability theory for prediction and anomaly detection.

Real world applications are probabilistic in nature, and to represent the relationship between multiple events, we need a
Bayesian network. It can also be used in various tasks including prediction, anomaly detection, diagnostics,
automated insight, reasoning, time series prediction, and decision making under uncertainty.

Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under uncertain knowledge is
known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random variables.
These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link that means
that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the network
graph.
o If we are considering node B, which is connected with node A by a directed arrow, then node A is called
the parent of Node B.
o Node C is independent of node A.

The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic graph or DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which determines the effect
of the parent on that node.

Explanation of Bayesian network:

the Bayesian network through an example by creating a directed acyclic graph:

Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake occurred,
and David and Sophia both called the Harry.

Solution:

o The Bayesian network for the above problem is given below. The network structure is showing that burglary
and earthquake is the parent node of the alarm and directly affecting the probability of alarm's going off, but
David and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do not notice
the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set of cases
for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if there are two parents,
then CPT will contain 4 probability values

List of all events occurring in this network:


o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the above
probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.


P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= P(A=
True) False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.
A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.

UNIT II

0. a) Illustrate the concept of Support Vector Machine with examples. (11) (CO2,K2)

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in
Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or hyperplane

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we
need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as
shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-
dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between the data
points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are
termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

Advantages of SVM:

● Effective in high-dimensional cases.


● Its memory is efficient as it uses a subset of training points in the decision function called support vectors.
● Different kernel functions can be specified for the decision functions and its possible to specify custom kernels.

14. (a)How to construct the random forest tree (11) (CO2, K2)

Constructing a random forest tree involves several steps:

1. Data Preparation:

● Before constructing the random forest, it's essential to prepare the data appropriately. This includes handling
missing values, converting categorical features into numerical representations, and normalizing numerical
features if necessary.

2. Bootstrap Sampling:

● Random forest employs bootstrap sampling to introduce randomness and reduce overfitting. In bootstrap
sampling, a random subset of the training data is selected with replacement, meaning some data points may
appear multiple times while others may not appear at all.

3. Feature Subset Selection:

● For each bootstrap sample, a random subset of features is selected for building the decision tree. This process,
known as feature subset selection, helps prevent overfitting by reducing the number of features considered for
each tree.
4. Decision Tree Construction:

● Using the bootstrap sample and randomly selected features, a decision tree is constructed. The construction
process involves recursively splitting the data into smaller and smaller subsets based on the chosen features and
their values.

5. Splitting Criteria:

● At each split, a splitting criterion is used to determine the best feature and its value for separating the data.
Common splitting criteria include information gain, Gini index, and chi-squared test.

6. Pruning:

● To prevent overfitting, pruning is applied to the decision tree. Pruning involves removing nodes that contribute
little to the overall predictive performance of the tree.

7. Repeat Steps 2-6:

● Steps 2-6 are repeated for a specified number of trees, typically hundreds or thousands. Each tree is constructed
independently using different bootstrap samples and random feature subsets.

8. Aggregation:

● After constructing all the decision trees, their predictions are aggregated to determine the final prediction for a
new data point. For classification tasks, majority voting is often used, where the class predicted by the majority
of trees becomes the final prediction. For regression tasks, averaging the predictions from all trees is common.

The resulting ensemble of decision trees, known as a random forest, is a powerful machine learning model that can
handle complex relationships and provide robust predictions.

These are the steps used to create random forest tree.

UNIT III

0. a) Illustrate the different methods of Principal Component Analysis with examples


Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a high-dimensional
dataset into a lower-dimensional representation while preserving as much of the original information as possible.
PCA is widely used in various fields, including data analysis, machine learning, and pattern recognition.

Methods of Principal Component Analysis


There are several methods for performing PCA, but the most common approach involves the following steps:

● Data Standardization: Normalize the data by centering each feature and scaling it to have a unit variance. This
ensures that all features are on the same scale and have equal influence on the analysis.

● Covariance Matrix Computation: Calculate the covariance matrix of the standardized data. The covariance
matrix represents the relationships between the different features.
● Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This process
identifies the principal components, which are the directions of maximum variance in the data.

● Projection onto Principal Components: Project the data onto the selected principal components. This involves
selecting the k principal components with the largest eigenvalues and projecting the data onto these k directions.

Examples of Principal Component Analysis Iris Flower Dataset:

The Iris flower dataset consists of measurements of the sepal length, sepal width, petal length, and petal width of
150 Iris flower specimens from three different species. PCA can be applied to this dataset to reduce the
dimensionality from four features to two principal components, which can then be used to visualize the distribution
of the three species in two-dimensional space.

1. Face Recognition: In face recognition systems, PCA can be used to reduce the dimensionality of high-
resolution facial images, making it computationally efficient to compare and identify faces. By projecting facial
images onto the principal components, the essential features of the faces can be extracted while discarding
redundant information.

2. Market Basket Analysis: In market basket analysis, PCA can be applied to a dataset of customer purchases to
identify patterns in buying behavior. By reducing the dimensionality of the purchase data, PCA can reveal the
underlying relationships between products and customer preferences.

These examples demonstrate the versatility of PCA in various domains, where dimensionality reduction is essential
for efficient data analysis and pattern recognition. PCA provides a powerful tool for extracting meaningful
information from high-dimensional data while reducing computational
complexity.
OR

0. a) Formulate the E-M algorithms and how Guassian Mixture Models works. (11)(CO3,K3)

Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is a statistical iterative algorithm used to estimate the parameters of
a statistical model. It is particularly useful for models that have latent variables, which are variables that are not
directly observed but are instead inferred from the observed data.
The EM algorithm works by iterating between two steps:

1. Expectation step: In the expectation step, the algorithm calculates the expected values of the latent
variables given the current parameter estimates.
2. Maximization step: In the maximization step, the algorithm maximizes the expected likelihood function
with respect to the parameters, given the expected values of the latent variables.
The EM algorithm is guaranteed to converge to a local maximum of the likelihood function, but it may not find the
global maximum.

Gaussian Mixture Models (GMMs)


A Gaussian mixture model (GMM) is a probabilistic model that assumes that the data is generated from a mixture
of a finite number of Gaussian distributions. GMMs are widely used in a variety of applications, including clustering,
density estimation, and anomaly detection.
A GMM is defined by the following parameters:
● Number of components: The number of Gaussian distributions in the mixture.
● Means: The mean vectors of the Gaussian distributions.
● Covariances: The covariance matrices of the Gaussian distributions.
● Weights: The weights of the Gaussian distributions, which represent the proportion of data points that are
generated from each distribution.
The EM algorithm can be used to estimate the parameters of a GMM from a dataset of data points. The expectation
step involves calculating the posterior probability of each data point belonging to each Gaussian distribution.
.
Relationship between EM and GMMs:
The EM algorithm is a general algorithm that can be used to estimate the parameters of a wide variety of statistical
models. However, it is particularly well-suited for estimating the parameters of GMMs. This is because GMMs have
a latent variable, which is the component that each data point belongs to. The EM algorithm is able to infer the
values of this latent variable by iterating between the expectation and maximization steps.

Applications of Gaussian Mixture Models

● Clustering: GMMs can be used to cluster data points into a pre-specified number of clusters. Each cluster
is represented by a Gaussian distribution, and the data points are assigned to the cluster that corresponds to
the Gaussian distribution with the highest posterior probability.
● Density estimation: GMMs can be used to estimate the density of the data. This is useful for tasks such as
anomaly detection and outlier detection.
● Feature extraction: GMMs can be used to extract features from data. This is useful for tasks such as image
classification and speech recognition.

The EM algorithm and GMMs are powerful tools for statistical modeling. They are widely used in a variety of
applications and have proven to be effective in a wide range of tasks.

UNIT IV

0. a). (i) List down the applications of Convolutional Neural Network over Recurrent Neural Network.

(ii) Discuss the time series prediction method in detail ((11) (CO4,K3)

OR

0. a) Illustrate with suitable example about image to text classification method.


Image to text classification is a type of machine learning task that involves automatically assigning a text
description to an image. It's a powerful tool for understanding and interpreting visual content, with applications in
various fields, including image retrieval, content moderation, and accessibility.

Example: Identifying Objects in Images


Imagine an image of a cat sitting on a chair. An image to text classification model trained on a large dataset of
images with corresponding text descriptions could analyze the image and generate a text description like "A fluffy
black cat perched on a wooden chair."

Process:

1. Image Preprocessing: The image is preprocessed to prepare it for analysis. This may involve resizing,
converting to grayscale, or normalizing the pixel values.
2. Feature Extraction: The model extracts features from the image, such as edges, shapes, and colors. These
features capture the essential visual information of the image.
3. Feature Representation: The extracted features are converted into a numerical representation that the model
can understand. This representation allows the model to learn relationships between the features and the
corresponding text descriptions.
4. Model Training: The model is trained on a dataset of images with corresponding text descriptions. During
training, the model learns to associate the extracted features with the correct text descriptions.
5. Image Classification: Once trained, the model can be used to classify new images. The model analyzes the
features of a new image and predicts the most likely text description based on the learned associations.

Applications:

● Image Search and Retrieval: Image to text classification enables efficient search engines that can find images
based on textual queries.
● Content Moderation: Identifying and filtering inappropriate or harmful content in images is crucial for social
media platforms and online communities.
● Accessibility: Image to text descriptions can make visual content accessible to people with visual impairments.
● Robotic Vision: Image to text classification plays a role in enabling robots to understand and interact with the
physical world

UNIT V

0. Explain the concept of bagging , boosting and cascading (11) (CO5,K2)

Bagging, boosting, and cascading are all ensemble learning techniques that aim to improve the performance of
machine learning models by combining multiple models. These techniques work by generating multiple models,
each trained on a different subset of the data, and then aggregating the predictions of these models to make a final
prediction.

Bagging (Bootstrap Aggregation):


Bagging is a simple and effective ensemble method that works by creating multiple models from different bootstrap
samples of the training data. A bootstrap sample is a subset of the training data created by randomly selecting data
points with replacement. This means that some data points may appear multiple times in the bootstrap sample, while
others may not appear at all.
Bagging is particularly effective in reducing the variance of the model, which is a measure of how much the model's
predictions vary between different training datasets. By averaging the predictions of multiple models trained on
different bootstrap samples, bagging can reduce the overall variance and improve the model's generalization
performance.

Boosting:

Boosting is a more sophisticated ensemble method that builds models sequentially, focusing on the data points that
were misclassified by previous models. Each model in the sequence is trained on a weighted version of the training
data, where the weights are assigned based on the errors of the previous model. This means that the model pays
more attention to the data points that it is struggling with, which helps to improve the model's overall performance.
Boosting is particularly effective in reducing the bias of the model, which is a measure of how much the model's
predictions are shifted away from the true values. By focusing on the data points that were misclassified by previous
models, boosting can help to correct for the biases of these models and make the overall model more accurate.

Cascading:

Cascading is a less common ensemble method that combines multiple models in a hierarchical manner. The models
are arranged in a cascade, where the output of one model is used as the input for the next model. This allows the
models to specialize in different aspects of the problem, which can improve the overall performance of the ensemble.
Cascading is particularly effective in problems where there are multiple stages of classification or regression. For
example, a cascading model could be used to first classify an image as a cat or a dog, and then use this classification
as input to a second model that classifies the cat or dog as a specific breed.

OR

20.Explain the concept of Error correction output method and ensemble fine tuning method (11) (CO5,K2)
Method (ECOC)
Error Correction Output Code (ECOC) is an ensemble learning technique that transforms a multi-class
classification problem into multiple binary classification problems. This allows the use of native binary
classification models to be used directly for multi-class classification tasks.

ECOC works by encoding each class as a binary codeword. The binary codewords are designed to have error-
correcting properties, meaning that if a few of the binary classifiers make mistakes, the overall prediction will still
be correct.

The ECOC method is typically implemented by first training a set of binary classifiers on the training data. Then,
for each data point to be classified, the outputs of all of the binary classifiers are combined to determine the predicted
class.

The ECOC method has several advantages, including:

● It can be used with any type of binary classifier.


● It is relatively easy to implement.
● It is often more accurate than other ensemble methods, such as bagging and boosting.

However, the ECOC method also has some disadvantages, including:

● It can be computationally expensive to train.


● The number of binary classifiers required can be large.
● The performance of the ECOC method can be sensitive to the design of the binary codewords.

Ensemble Fine-tuning Method:


Ensemble fine-tuning is a technique for improving the performance of an ensemble of models by fine-tuning the
parameters of the individual models. This is done by retraining each model on a new training dataset that is created
by combining the outputs of the individual models

Ensemble fine-tuning can be an effective way to improve the performance of an ensemble, especially if the
individual models are not very well-tuned. However, it can also be time-consuming and computationally expensive.

The ensemble fine-tuning method is typically implemented by first training a set of individual models on the training
data. Then, for each data point in the training data, the outputs of all of the individual models are combined to create
a new training label. The individual models are then retrained on the new training dataset.

The ensemble fine-tuning method has several advantages, including:

● It can improve the performance of an ensemble of models.


● It can be used with any type of ensemble method.

However, the ensemble fine-tuning method also has some disadvantages, including:
● It can be time-consuming and computationally expensive.
● It can be difficult to choose the right hyperparameters for the fine-tuning process.

*********All the best****************

You might also like