Aiml FML Answer Key
Aiml FML Answer Key
Year/Semester/Section: III/ V
Dept: AIML Max. Marks: 75
[Bloom’s Taxonomy: K1- Remember, K2- Understand, K3-Apply,
K4-Analysis]
PART – A
Answer all the questions (10*2 = 20)
1. How Machine Perceptron works? (CO1,K1)
Perceptron is a single-layer neural network linear or a Machine Learning algorithm used for supervised
learning of various binary classifiers. It works as an artificial neuron to perform computations by learning elements
and processing them for detecting the business intelligence and capabilities of the input data.
2. Illustrate the linear regression model provides a sloped straight line representing the relationship between
the variables (CO1,K1)
9. What is the difference between bagging and boosting in machine learning? (CO5,K2)
10. List the purpose of cascading (CO5,K2)
A cascade neural network is used to detect and isolate three types of faults. Eight machine learning classifiers are
compared in the task of fault isolation. Real irradiance profiles with clouds are used to train the classifiers. The classifiers
are fed with present and past information to increase accuracy.
PART – B
Answer all the questions (5*11 = 55)
UNIT I
0. a) Explain the Classifier, Discriminant Function and decision Surface in detail. (11)
Classifiers:
A classifier is a statistical function that assigns a data point to one of a predefined set of classes. Classifiers are used
in a wide variety of applications, including machine learning, pattern recognition, and image analysis.
There are many different types of classifiers, but they all share the same basic goal: to accurately predict the class
of a new data point. Some common types of classifiers include:
● Linear classifiers: These classifiers make predictions based on a linear combination of the input features.
● Nonlinear classifiers: These classifiers can make predictions on data that is not linearly separable.
● Probabilistic classifiers: These classifiers assign a probability to each class, indicating the likelihood that
the data point belongs to that class.
Discriminant Functions:
A discriminant function is a function that measures the distance between a data point and a class. The discriminant
function is used by the classifier to make a prediction about the class of the data point.
The discriminant function for a particular class is typically defined as the log of the posterior probability of the class
given the data point. This means that the discriminant function is high for data points that are likely to belong to the
class and low for data points that are unlikely to belong to the class.
Decision Surfaces:
A decision surface is a boundary in the input space that separates one class from another. The decision surface is
determined by the discriminant function.
Data points on one side of the decision surface are predicted to belong to one class, while data points on the other
side of the decision surface are predicted to belong to the other class.
The shape of the decision surface depends on the type of classifier and the input data. For example, the decision
surface for a linear classifier is a hyperplane, while the decision surface for a nonlinear classifier can be more
complex.
0. b) Discuss about the Bayesian Belief Network concept (11) (CO1, K4)
Bayesian belief network is key computer technology for dealing with probabilistic events and to solve a problem which
has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and their conditional
dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability distribution, and also use
probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between multiple events, we need a
Bayesian network. It can also be used in various tasks including prediction, anomaly detection, diagnostics,
automated insight, reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it consists of two parts:
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities between random variables.
These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed link that means
that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the nodes of the network
graph.
o If we are considering node B, which is connected with node A by a directed arrow, then node A is called
the parent of Node B.
o Node C is independent of node A.
The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a directed acyclic graph or DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which determines the effect
of the parent on that node.
Calculate the probability that alarm has sounded, but there is neither a burglary, nor an earthquake occurred,
and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is showing that burglary
and earthquake is the parent node of the alarm and directly affecting the probability of alarm's going off, but
David and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the burglary and also do not notice
the minor earthquake, and they also not confer before calling.
o The conditional distributions for each node are given as conditional probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive set of cases
for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities. Hence, if there are two parents,
then CPT will contain 4 probability values
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can rewrite the above
probability statement using joint probability distribution:
Let's take the observed probability for the Burglary and earthquake component:
B E P(A= P(A=
True) False)
The Conditional probability of David that he will call depends on the probability of Alarm.
A P(D= True) P(D= False)
The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."
From the formula of joint distribution, we can write the problem statement in the form of probability distribution:
= 0.00068045.
UNIT II
0. a) Illustrate the concept of Support Vector Machine with examples. (11) (CO2,K2)
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in
Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below
diagram in which there are two different categories that are classified using a decision boundary or hyperplane
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into
two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is
used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be
classified by using a straight line, then such data is termed as non-linear data and classifier used is called as
Non-linear SVM classifier.
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we
need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as
shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-
dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the data
points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are
termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
Advantages of SVM:
14. (a)How to construct the random forest tree (11) (CO2, K2)
1. Data Preparation:
● Before constructing the random forest, it's essential to prepare the data appropriately. This includes handling
missing values, converting categorical features into numerical representations, and normalizing numerical
features if necessary.
2. Bootstrap Sampling:
● Random forest employs bootstrap sampling to introduce randomness and reduce overfitting. In bootstrap
sampling, a random subset of the training data is selected with replacement, meaning some data points may
appear multiple times while others may not appear at all.
● For each bootstrap sample, a random subset of features is selected for building the decision tree. This process,
known as feature subset selection, helps prevent overfitting by reducing the number of features considered for
each tree.
4. Decision Tree Construction:
● Using the bootstrap sample and randomly selected features, a decision tree is constructed. The construction
process involves recursively splitting the data into smaller and smaller subsets based on the chosen features and
their values.
5. Splitting Criteria:
● At each split, a splitting criterion is used to determine the best feature and its value for separating the data.
Common splitting criteria include information gain, Gini index, and chi-squared test.
6. Pruning:
● To prevent overfitting, pruning is applied to the decision tree. Pruning involves removing nodes that contribute
little to the overall predictive performance of the tree.
● Steps 2-6 are repeated for a specified number of trees, typically hundreds or thousands. Each tree is constructed
independently using different bootstrap samples and random feature subsets.
8. Aggregation:
● After constructing all the decision trees, their predictions are aggregated to determine the final prediction for a
new data point. For classification tasks, majority voting is often used, where the class predicted by the majority
of trees becomes the final prediction. For regression tasks, averaging the predictions from all trees is common.
The resulting ensemble of decision trees, known as a random forest, is a powerful machine learning model that can
handle complex relationships and provide robust predictions.
UNIT III
● Data Standardization: Normalize the data by centering each feature and scaling it to have a unit variance. This
ensures that all features are on the same scale and have equal influence on the analysis.
● Covariance Matrix Computation: Calculate the covariance matrix of the standardized data. The covariance
matrix represents the relationships between the different features.
● Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This process
identifies the principal components, which are the directions of maximum variance in the data.
● Projection onto Principal Components: Project the data onto the selected principal components. This involves
selecting the k principal components with the largest eigenvalues and projecting the data onto these k directions.
The Iris flower dataset consists of measurements of the sepal length, sepal width, petal length, and petal width of
150 Iris flower specimens from three different species. PCA can be applied to this dataset to reduce the
dimensionality from four features to two principal components, which can then be used to visualize the distribution
of the three species in two-dimensional space.
1. Face Recognition: In face recognition systems, PCA can be used to reduce the dimensionality of high-
resolution facial images, making it computationally efficient to compare and identify faces. By projecting facial
images onto the principal components, the essential features of the faces can be extracted while discarding
redundant information.
2. Market Basket Analysis: In market basket analysis, PCA can be applied to a dataset of customer purchases to
identify patterns in buying behavior. By reducing the dimensionality of the purchase data, PCA can reveal the
underlying relationships between products and customer preferences.
These examples demonstrate the versatility of PCA in various domains, where dimensionality reduction is essential
for efficient data analysis and pattern recognition. PCA provides a powerful tool for extracting meaningful
information from high-dimensional data while reducing computational
complexity.
OR
0. a) Formulate the E-M algorithms and how Guassian Mixture Models works. (11)(CO3,K3)
The Expectation-Maximization (EM) algorithm is a statistical iterative algorithm used to estimate the parameters of
a statistical model. It is particularly useful for models that have latent variables, which are variables that are not
directly observed but are instead inferred from the observed data.
The EM algorithm works by iterating between two steps:
1. Expectation step: In the expectation step, the algorithm calculates the expected values of the latent
variables given the current parameter estimates.
2. Maximization step: In the maximization step, the algorithm maximizes the expected likelihood function
with respect to the parameters, given the expected values of the latent variables.
The EM algorithm is guaranteed to converge to a local maximum of the likelihood function, but it may not find the
global maximum.
● Clustering: GMMs can be used to cluster data points into a pre-specified number of clusters. Each cluster
is represented by a Gaussian distribution, and the data points are assigned to the cluster that corresponds to
the Gaussian distribution with the highest posterior probability.
● Density estimation: GMMs can be used to estimate the density of the data. This is useful for tasks such as
anomaly detection and outlier detection.
● Feature extraction: GMMs can be used to extract features from data. This is useful for tasks such as image
classification and speech recognition.
The EM algorithm and GMMs are powerful tools for statistical modeling. They are widely used in a variety of
applications and have proven to be effective in a wide range of tasks.
UNIT IV
0. a). (i) List down the applications of Convolutional Neural Network over Recurrent Neural Network.
(ii) Discuss the time series prediction method in detail ((11) (CO4,K3)
OR
Process:
1. Image Preprocessing: The image is preprocessed to prepare it for analysis. This may involve resizing,
converting to grayscale, or normalizing the pixel values.
2. Feature Extraction: The model extracts features from the image, such as edges, shapes, and colors. These
features capture the essential visual information of the image.
3. Feature Representation: The extracted features are converted into a numerical representation that the model
can understand. This representation allows the model to learn relationships between the features and the
corresponding text descriptions.
4. Model Training: The model is trained on a dataset of images with corresponding text descriptions. During
training, the model learns to associate the extracted features with the correct text descriptions.
5. Image Classification: Once trained, the model can be used to classify new images. The model analyzes the
features of a new image and predicts the most likely text description based on the learned associations.
Applications:
● Image Search and Retrieval: Image to text classification enables efficient search engines that can find images
based on textual queries.
● Content Moderation: Identifying and filtering inappropriate or harmful content in images is crucial for social
media platforms and online communities.
● Accessibility: Image to text descriptions can make visual content accessible to people with visual impairments.
● Robotic Vision: Image to text classification plays a role in enabling robots to understand and interact with the
physical world
UNIT V
Bagging, boosting, and cascading are all ensemble learning techniques that aim to improve the performance of
machine learning models by combining multiple models. These techniques work by generating multiple models,
each trained on a different subset of the data, and then aggregating the predictions of these models to make a final
prediction.
Boosting:
Boosting is a more sophisticated ensemble method that builds models sequentially, focusing on the data points that
were misclassified by previous models. Each model in the sequence is trained on a weighted version of the training
data, where the weights are assigned based on the errors of the previous model. This means that the model pays
more attention to the data points that it is struggling with, which helps to improve the model's overall performance.
Boosting is particularly effective in reducing the bias of the model, which is a measure of how much the model's
predictions are shifted away from the true values. By focusing on the data points that were misclassified by previous
models, boosting can help to correct for the biases of these models and make the overall model more accurate.
Cascading:
Cascading is a less common ensemble method that combines multiple models in a hierarchical manner. The models
are arranged in a cascade, where the output of one model is used as the input for the next model. This allows the
models to specialize in different aspects of the problem, which can improve the overall performance of the ensemble.
Cascading is particularly effective in problems where there are multiple stages of classification or regression. For
example, a cascading model could be used to first classify an image as a cat or a dog, and then use this classification
as input to a second model that classifies the cat or dog as a specific breed.
OR
20.Explain the concept of Error correction output method and ensemble fine tuning method (11) (CO5,K2)
Method (ECOC)
Error Correction Output Code (ECOC) is an ensemble learning technique that transforms a multi-class
classification problem into multiple binary classification problems. This allows the use of native binary
classification models to be used directly for multi-class classification tasks.
ECOC works by encoding each class as a binary codeword. The binary codewords are designed to have error-
correcting properties, meaning that if a few of the binary classifiers make mistakes, the overall prediction will still
be correct.
The ECOC method is typically implemented by first training a set of binary classifiers on the training data. Then,
for each data point to be classified, the outputs of all of the binary classifiers are combined to determine the predicted
class.
Ensemble fine-tuning can be an effective way to improve the performance of an ensemble, especially if the
individual models are not very well-tuned. However, it can also be time-consuming and computationally expensive.
The ensemble fine-tuning method is typically implemented by first training a set of individual models on the training
data. Then, for each data point in the training data, the outputs of all of the individual models are combined to create
a new training label. The individual models are then retrained on the new training dataset.
However, the ensemble fine-tuning method also has some disadvantages, including:
● It can be time-consuming and computationally expensive.
● It can be difficult to choose the right hyperparameters for the fine-tuning process.