0% found this document useful (0 votes)
712 views180 pages

Unit 3: Classification & Regression: Question Bank and Its Solution

1. The document provides a question bank and solutions for the Unit 3 topics of Classification & Regression in an Artificial Intelligence & Machine Learning course. 2. The unit covers decision trees, random forests, naive Bayes, support vector machines, logistic regression, K-means, and K-nearest neighbors algorithms. 3. The question bank includes theory questions, mathematics questions, and numerical questions ranging from 2 to 6 marks on the covered classification and regression techniques.

Uploaded by

Tejas Narsale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
712 views180 pages

Unit 3: Classification & Regression: Question Bank and Its Solution

1. The document provides a question bank and solutions for the Unit 3 topics of Classification & Regression in an Artificial Intelligence & Machine Learning course. 2. The unit covers decision trees, random forests, naive Bayes, support vector machines, logistic regression, K-means, and K-nearest neighbors algorithms. 3. The question bank includes theory questions, mathematics questions, and numerical questions ranging from 2 to 6 marks on the covered classification and regression techniques.

Uploaded by

Tejas Narsale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 180

Artificial Intelligence & Machine Learning

Course Code: 302049

Unit 3: Classification & Regression


Third Year Bachelor of Engineering (Choice Based Credit System)
Mechanical Engineering (2019 Course)
Board of Studies – Mechanical and Automobile Engineering, SPPU, Pune
(With Effect from Academic Year 2021-22)

Question bank and its solution


by

Abhishek D. Patange, Ph.D.


Department of Mechanical Engineering
College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Unit 3: Classification & Regression


Syllabus:
Content Theory Mathematics Numerical
• Classification & Regression
Decision tree (C & R)
Random forest (C & R)
Naive Bayes (C)
Support vector machine (C & R)
Logistic Regression (R)
K-Means,
K-Nearest Neighbor (KNN)
• Applications of classification and regression algorithms in Mechanical
Engineering
Note: ‘C’ stands for classification and ‘R’ stands for regression

Type of question and marks:


Type Theory Mathematics Numerical
Marks 2 or 4 or 6 4 marks 2 or 4 marks

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: Decision trees

Theory questions

1. Why use decision trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
 Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
 It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub-trees. Below diagram explains the general structure of a decision tree.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 It is a tree-structured classifier, where internal nodes represent the features of a


dataset, branches represent the decision rules and each leaf node represents the
outcome.
 Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
 The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

2. Explain decision tree terminology.

The decision tree comprises of root node, leaf node, branch nodes, parent/child node etc.
following is the explanation of this terminology.
 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

3. How does the Decision Tree algorithm Work for classification?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node. For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree.
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM)
i.e. information gain and Gini index.
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM).

The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two
leaf nodes (Accepted offers and Declined offer). See the above figure.

4. How does the Decision Tree algorithm Work for regression?

The general idea is that we will segment the predictor space into a number of simple
regions. In order to make a prediction for a given observation, we typically use the mean of
the training data in the region to which it belongs. Since the set of splitting rules used to
segment the predictor space can be summarized by a tree such approaches are called
decision tree methods. These methods are simple and useful for interpretation. We want to
predict a response or class Y from inputs X1,X2, . . .Xp. We do this by growing a binary tree.
At each internal node in the tree, we apply a test to one of the inputs, say Xi . Depending on
the outcome of the test, we go to either the left or the right sub-branch of the tree. Eventually
we come to a leaf node, where we make a prediction. This prediction aggregates or
averages all the training data points which reach that leaf. In order to motivate regression
trees, we begin with a simple example. Our motivation is to predict a baseball player’s Salary
based on Years (the number of years that he has played in the major leagues) and Hits (the
number of hits that he made in the previous year). We first remove observations that are

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

missing Salary values and log-transform Salary so that its distribution has more of a typical
bell-shape. Recall that Salary is measured in thousands of dollars.

The tree represents a series of splits starting at the top of the tree. The top split assigns
observations having Years < 4.5 to the left branch. The predicted salary for these players is
given by the mean response value for the players in the data set with Years < 4.5.For such
players, the mean log salary is 5.107, and so we make a prediction of e5.107 thousands of
dollars, i.e. 165, 174. How would you interpret the rest (right branch) of the tree?

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 In keeping with the tree analogy, the regions R1, R2, and R3 are known as terminal
nodes or leaves of the tree.
 As is the case for Figure 2, decision trees are typically drawn upside down, in the sense
that the leaves are at the bottom of the tree.
 The points along the tree where the predictor space is split are referred to as internal
nodes.
 In Figure 2, the two internal nodes are indicated by the text Years < 4.5 and Hits < 117.5.
 We refer to the segments of the trees that connect the nodes as branches.
 Years is the most important factor in determining Salary, and players with less experience
earn lower salaries than more experienced players.
 Given that a player is less experienced, the number of hits that he made in the previous
year seems to play little role in his salary.
 But among players who have been in the major leagues for five or more years, the
number of hits made in the previous year does affect salary, and players who made more
hits last year tend to have higher salaries.
 The regression tree shown in Figure 2 is likely an over-simplification of the true
relationship between Hits, Years, and Salary, but it‘s a very nice easy interpretation over
more complicated approaches.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Mathematics based questions

5. Explain entropy reduction, information gain and Gini index in decision tree.

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
Information Gain:
 Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the decision
tree. A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S) – [(Weighted Average) * Entropy (each feature)]
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Where, S= Total number of samples, P(yes)= probability of yes, P(no)= probability of no
Gini Index:
 Gini index is a measure of impurity or purity used while creating a decision tree in the
CART (Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high Gini
index. Gini index can be calculated using the formula: Gini Index= 1 – ∑jPj2
 It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.

6. What are advantages and limitations of the decision trees?

Advantages of the Decision Tree


 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Disadvantages of the Decision Tree


 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may
increase.

7. Many times while training decision tree tends to overfit. What is the reason

behind it and how to avoid it?

Decision tree tends to overfit since at each node, it will make the decision among a subset of
all the features (columns), so when it reaches a final decision, it is a complicated and long
decision chain. Only if a data point satisfies all the rules along this chain, the final decision
can be made. This kind of specific rule on training dataset make it very specific for the
training set, on the other hand, cannot generalize well for new data points that it has never
seen. Especially when your dataset has many features (high dimension), it tends to overfit
more. In J48 decision tree, over fitting happens when algorithm gets information with
exceptional attributes. This causes many fragmentations in the process distribution.
Statistically unimportant nodes with least examples are known as fragmentations. Usually
J48 algorithm builds trees and grows its branches ‗just deep enough to perfectly classify the
training examples‘. This approach performs better with noise free data. But most of the time
this strategy overfits the training examples with noisy data. At present there are two
strategies which are widely used to bypass this overfitting in decision tree learning. Those
are: 1) If tree grows taller, stop it from growing before it reaches the maximum point of
accurate classification of the training data. 2) Let the tree to over-fit the training data then
post-prune tree. By default, the decision tree model is allowed to grow to its full depth.
Pruning refers to a technique to remove the parts of the decision tree to prevent growing to
its full depth. By tuning the hyperparameters of the decision tree model one can prune the
trees and prevent them from overfitting. There are two types of pruning Pre-pruning and
Post-pruning. Now let's discuss the in-depth understanding and hands-on implementation
of each of these pruning techniques.
Pre-Pruning:
The pre-pruning technique refers to the early stopping of the growth of the decision tree.
The pre-pruning technique involves tuning the hyperparameters of the decision tree model
prior to the training pipeline. The hyperparameters of the decision tree including
max_depth, min_samples_leaf, min_samples_split can be tuned to early stop the growth
of the tree and prevent the model from overfitting.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Post-Pruning:
The Post-pruning technique allows the decision tree model to grow to its full depth, then
removes the tree branches to prevent the model from overfitting. Cost complexity pruning
(ccp) is one type of post-pruning technique. In case of cost complexity pruning, the
ccp_alpha can be tuned to get the best fit model.

Problems/Numerical

8. Problems on calculating entropy and information gain

Problem 1:
If we decided to arbitrarily label all 4 gumballs as red, how often would one of the gumballs
is incorrectly labelled?
4 red and 0 blue:

The impurity measurement is 0 because we would never incorrectly label any of the 4 red
gumballs here. If we arbitrarily chose to label all the balls ‗blue‘, then our index would still be
0, because we would always incorrectly label the gumballs.
The gini score is always the same no matter what arbitrary class you take the probabilities of
because they always add to 0 in the formula above.
A gini score of 0 is the most pure score possible.
2 red and 2 blue:

The impurity measurement is 0.5 because we would incorrectly label gumballs wrong about
half the time. Because this index is used in binary target variables (0,1), a gini index of 0.5 is
the least pure score possible. Half is one type and half is the other. Dividing gini scores by
0.5 can help intuitively understand what the score represents. 0.5/0.5 = 1, meaning the
grouping is as impure as possible (in a group with just 2 outcomes).
3 red and 1 blue:

The impurity measurement here is 0.375. If we divide this by 0.5 for more intuitive
understanding we will get 0.75, which is the probability of incorrectly/correctly labeling.
Problem 2:
How does entropy work with the same gumball scenarios stated in problem 1?
4 red and 0 blue:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Unsurprisingly, the impurity measurement is 0 for entropy as well. This is the max purity
score using information entropy.
2 red and 2 blue:

The impurity measurement is 1 here, as it‘s the maximum impurity obtainable.


3 red and 1 blue:

The purity/impurity measurement is 0.811 here, a bit worse than the gini score.
Problem 3:
Calculate entropy for following example.
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3

= - [0.375 * (-1.415) + 0.625 * (-0.678)]


= - (-0.53-0.424)
= 0.954
Problem 4:
In the below mini-dataset, the label we‘re trying to predict is the type of fruit. This is based
off the size, color, and shape variables.

Calculate the information gained if we select the color variable.


3 out of the 6 records are yellow, 2 are green, and 1 is red. Proportionally, the probability of
a yellow fruit is 3 / 6 = 0.5; 2 / 6 = 0.333 for green, and 1 / 6 = 0.1666 for red. Using the
formula from above, we can calculate it like this:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Information gain = - ([3/6 * log2(3/6)] + [2/6 * log2(2/6)] + [1/6 * log2(1/6)]) = 1.459148


Calculate the information gained if we select the size variable.
Information gain = - ([3/6 * log2(3/6)] + [2/6 * log2(2/6)] + [1/6 * log2(1/6)]) = 1.459148
In this case, 3 / 6 of the fruits are medium-sized, 2 / 6 are small, 1 / 6 is big.
Calculate the information gained if we select the shape variable.
Here, 5 / 6 of the fruits are round and 1 / 6 is thin.
Information gain = - ([5/6 * log2(5/6)] + [1/6 * log2(1/6)]) = 0.650022
Problem 5:
Consider the following data set for a binary class problem.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Problem 6:
Consider the training examples shown in Table below for a binary classification problem.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Problem 7:
Consider the training examples shown in Table below for a binary classification problem.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: Random forest tree

Theory questions

9. Why use random forest trees?

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output. The greater number of trees in the forest leads to higher
accuracy and prevents the problem of overfitting.
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
 There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
 The predictions from each tree must have very low correlations.
The below diagram explains the working of the Random Forest algorithm:

Below are some points that explain why we should use the Random Forest algorithm:
 It takes less training time as compared to other algorithms.
 It predicts output with high accuracy, even for the large dataset it runs efficiently.
 It can also maintain accuracy when a large proportion of data is missing.
 It can be used for both classifications as well as regression tasks.
 Overfitting problem that is censorious and can make results poor but in case of the
random forest the classifier will not overfit if there are enough trees.
 It can be used for categorical values as well.

10. How does the random forest tree work for classification?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:

11. Explain random forest tree terminology.

 Bagging: Given the training set of N examples, we repeatedly sample subsets of the
training data of size n where n is less than N. Sampling is done at random but with
replacement. This subsampling of a training set is called bootstrap aggregating, or
bagging, for short.
 Random subspace method: If each training example has M features, we take a subset of
them of size m < M to train each estimator. So no estimator sees the full training set,
each estimator sees only m features of n training examples.
 Training estimators: We create Ntree decision trees, or estimators, and train each one on
a different set of m features and n training examples. The trees are not pruned, as they
would be in the case of training a simple decision tree classifier.
 Perform inference by aggregating predictions of estimators: To make a prediction
for a new incoming example, we pass the relevant features of this example to each of the
Ntree estimators. We will obtain Ntree predictions, which we need to combine to produce
the overall prediction of the random forest. In the case of classification, we will use
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

majority voting to decide on the predicted class, and in the case of regression, we will
take the mean value of the predictions of all the estimators.

Random forest inference for a simple classification example with Ntree = 3


This use of many estimators is the reason why the random forest algorithm is called an
ensemble method. Each individual estimator is a weak learner, but when many weak
estimators are combined together they can produce a much stronger learner. Ensemble
methods take a 'strength in numbers' approach, where the output of many small models is
combined to produce a much more accurate and powerful prediction.

12. What are advantages and limitations of the random forest tree?

Advantages of Random Forest


 Random Forest is capable of performing both Classification and Regression tasks.
 It is capable of handling large datasets with high dimensionality.
 It enhances the accuracy of the model and prevents the overfitting issue.
 It is fast and can deal with missing values data as well.
 Using random forest you can compute the relative feature importance.
 It can give good accuracy even if the higher volume of data is missing.
Limitations of Random Forest
 Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
 Random forest is a complex algorithm that is not easy to interpret.
 Complexity is large.
 Predictions given by random forest takes many times if we compare it to other
algorithms
 Higher computational resources are required to use a random forest algorithm.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

13. What is the difference between simple decision tree and random forest tree?

SN Random Forest Decision Tree


1. While building a random forest the Whereas, it built several decision trees
number of rows is selected randomly. and find out the output.
2. It combines two or more decision trees Whereas the decision is a collection of
together. variables or data set or attributes.
3. It gives accurate results. Whereas it gives less accurate results.
4. By using multiple trees it reduces the On the other hand, decision trees, it has
chances of overfitting. the possibility of overfitting, which is an
error that occurs due to variance or due
to bias.
5. Random forest is more complicated to Whereas, the decision tree is simple so it
interpret. is easy to read and understand.
6. In a random forest, we need to The decision tree is not accurate but it
generate, process, and analyze trees so processes fast which means it is fast to
that this process is slow, it may take implement.
one hour or even days.
7. It has more computation because it has Whereas it has less computation.
n number of decision trees, so more
decision trees more computation.
8. It has complex visualization, but it plays On the other hand, it is simple to
an important role to show hidden visualize because we just need to fit the
patterns behind the data. decision tree model.
9. The classification and regression Whereas a decision tree is used to solve
problems can be solved by using the classification and regression
random forest. problems.
10. It uses the random subspace method Whereas a decision is made based on the
and bagging during tree construction, selected sample‘s feature, this is usually a
which has built-in feature importance. feature that is used to make a decision,
decision tree learning is a process to find
the optimal value for each internal tree
node.
Decision trees are simple but suffer from some serious problems- overfitting, error due to
variance or error due to bias. Random Forest is the collection of decision trees with a single
and aggregated result. Using multiple trees in the random forest reduces the chances of

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

overfitting. And they are complex to understand. A decision tree is easy to read and
understand whereas random forest is more complicated to interpret. A single decision tree is
not accurate in predicting the results but is fast to implement. More trees will give a more
robust model and prevents overfitting. In the forest, we need to generate process and
analyze each and every tree. Therefore this process is a slow process and can sometimes take
hours or even days.

Mathematics based questions

14. Explain Bagging and Boosting in training random forest tree.

The Ensemble learning helps improve machine learning results by combining several models.
This approach allows the production of better predictive performance compared to a single
model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Bagging
and Boosting are two types of Ensemble Learning. These two decrease the variance of a
single estimate as they combine several estimates from different models. So the result may
be a model with higher stability. Let‘s understand these two terms in a glimpse.
1. Bagging: It is a homogeneous weak learners‘ model that learns from each other
independently in parallel and combines them for determining the model average.
2. Boosting: It is also a homogeneous weak learners‘ model but works differently from
Bagging. In this model, learners learn sequentially and adaptively to improve model
predictions of a learning algorithm.
Bagging: Bootstrap Aggregating, also known as bagging, is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine learning
algorithms used in statistical classification and regression. It decreases the variance and
helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special
case of the model averaging approach.
Description of the Technique
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each training
set D < i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the
votes and assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
 Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel from each training set and independent of each
other.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 Step 4: The final predictions are determined by combining the predictions from all the
models.

An illustration for the concept of bootstrap aggregating (Bagging)


Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance
are present. It makes random feature selection to grow trees. Several random trees make a
Random Forest.
Boosting
Boosting is an ensemble modelling technique that attempts to build a strong classifier from
the number of weak classifiers. It is done by building a model by using weak models in
series. Firstly, a model is built from the training data. Then the second model is built which
tries to correct the errors present in the first model. This procedure is continued and models
are added until either the complete training data set is predicted correctly or the maximum
number of models is added.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert Schapire and
Yoav Freund were not adaptive and could not take full advantage of the weak learners.
Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won
the prestigious Gödel Prize. AdaBoost was the first really successful boosting algorithm
developed for the purpose of binary classification. AdaBoost is short for Adaptive Boosting
and is a very popular boosting technique that combines multiple ―weak classifiers‖ into a
single ―strong classifier‖.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

4. if (got required results)


Goto step 5
Else
Goto step 2
5. End

An illustration presenting the intuition behind the boosting algorithm, consisting of the
parallel learners and weighted dataset
Similarities between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal similarity
of being classified as ensemble methods. Here we will explain the similarities between them.
 Both are ensemble methods to get N learners from 1 learner.
 Both generate several training data sets by random sampling.
 Both make the final decision by averaging the N learners (or taking the majority of them
i.e Majority Voting).
 Both are good at reducing variance and provide higher stability.
Differences between Bagging and Boosting
SN Bagging Boosting
1. The simplest way of combining predictions A way of combining predictions that
that belongs to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3. Each model receives equal weight. Models are weighted according to
their performance.
4. Each model is built independently. New models are influenced
by the performance of previously built
models.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

5. Different training data subsets are randomly Every new subset contains the
drawn with replacement from the entire elements that were misclassified by
training dataset. previous models.
6. Bagging tries to solve the over-fitting Boosting tries to reduce bias.
problem.
7. If the classifier is unstable (high variance), If the classifier is stable and simple
then apply bagging. (high bias) the apply boosting.
8. Example: The Random forest model uses Example: The AdaBoost uses Boosting
Bagging. techniques

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

15. Which is the best, Bagging or Boosting?

 There‘s not an outright winner; it depends on the data, the simulation and the
circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine
several estimates from different models. So the result may be a model with higher
stability.
 If the problem is that the single model gets a very low performance, Bagging will rarely
get a better bias. However, Boosting could generate a combined model with lower
errors as it optimises the advantages and reduces pitfalls of the single model.
 By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best
option. Boosting for its part doesn‘t help to avoid over-fitting; in fact, this technique is
faced with this problem itself. Thus, Bagging is effective more often than Boosting.

16. What are the main advantages of using a random forest versus a single

decision tree?

In an ideal world, we'd like to reduce both bias-related and variance-related errors. This issue
is well-addressed by random forests. A random forest is nothing more than a series of
decision trees with their findings combined into a single final result. They are so powerful
because of their capability to reduce overfitting without massively increasing error due to

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

bias. Random forests, on the other hand, are a powerful modelling tool that is far more
resilient than a single decision tree. They combine numerous decision trees to reduce
overfitting and bias-related inaccuracy, and hence produce usable results.

Theory Mathematics Numerical


Topic: Naive Bayes

Theory questions

17. Why use Naive Bayes algorithm?

 Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
 It is mainly used in text classification that includes a high-dimensional training dataset.
 Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
 It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
 Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
 Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

18. What are the Pros and Cons of using Naive Bayes?

Advantages of Naïve Bayes Classifier:


 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the other Algorithms.
 It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
 Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 The requirement of predictors to be independent. In most of the real life cases, the
predictors are dependent, this hinders the performance of the classifier.

19. How does the Bayes algorithm differ from decision trees?

 Decision tree is a discriminative model, whereas Naive bayes is a generative model.


 Decision trees are more flexible and easy. Decision tree pruning may neglect some key
values in training data, which can lead the accuracy for a toss.
 A major advantage to Naive Bayes classifiers is that they are not prone to overfitting,
thanks to the fact that they ―ignore‖ irrelevant features. They are, however, prone to
poisoning, a phenomenon that occurs when we are trying to predict a class but
features uncommon to appear, causing a misclassification.
 Naive Bayes classifiers are easily implemented and highly scalable, with a linear
computational complexity with respect to the number of data entries.
 Naive Bayes is strongly associated with text-based classification. Example applications
include but are not limited to spam filtering and text categorization. This is because the
presence of certain words is strongly linked to their respective categories, and thus the
mutual independence assumption is stronger.
 Unfortunately, several data sets require that some features are hand-picked before the
classifier can work as intended.
 Decision Trees are very flexible, easy to understand, and easy to debug. They will work
with classification problems and regression problems. So if you are trying to predict a
categorical value like (red, green, up, down) or if you are trying to predict a continuous
value like 2.9, 3.4 etc Decision Trees will handle both problems. Probably one of the
coolest things about Decision Trees is they only need a table of data and they will build a
classifier directly from that data without needing any up front design work to take place.
To some degree properties that don't matter won't be chosen as splits and will get
eventually pruned so it's very tolerant of nonsense. To start it's set it and forget it.
 However, the downside. Simple decision trees tend to over fit the training data more so
that other techniques which mean you generally have to do tree pruning and tune the
pruning procedures. You didn't have any upfront design cost, but you'll pay that back on
tuning the trees performance. Also simple decision trees divide the data into squares so
building clusters around things means it has to split a lot to encompass clusters of data.
Splitting a lot leads to complex trees and raises probability you are overfitting. Tall trees
get pruned back so while you can build a cluster around some feature in the data it
might not survive the pruning process. There are other techniques like surrogate splits
which let you split along several variables at once creating splits in the space that aren't
either horizontal or perpendicular ( 0 < slope < infinity ). Cool, but your tree starts to
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

become harder to understand, and its complex to implement these algorithms. Other
techniques like boosting and random forest decision trees can perform quite well, and
some feel these techniques are essential to get the best performance out of decision
trees. Again this adds more things to understand and use to tune the tree and hence
more things to implement. In the end the more we add to the algorithm the taller the
barrier to using it.
 Naive Bayes requires you build a classification by hand. There's not way to just toss a
bunch of tabular data at it and have it pick the best features it will use to classify. Picking
which features matter is up to you. Decisions trees will pick the best features for you
from tabular data. If there were a way for Naive Bayes to pick features you'd be getting
close to using the same techniques that make decision trees work like that. Give this fact
that means you may need to combine Naive Bayes with other statistical techniques to
help guide you towards what features best classify and that could be using decision
trees. Naive bayes will answer as a continuous classifier. There are techniques to adapt it
to categorical prediction however they will answer in terms of probabilities like (A 90%, B
5%, C 2.5% D 2.5%) Bayes can perform quite well, and it doesn't over fit nearly as much
so there is no need to prune or process the network. That makes them simpler
algorithms to implement. However, they are harder to debug and understand because
it's all probabilities getting multiplied 1000's of times so you have to be careful to test
it's doing what you expect. Naive bayes does quite well when the training data doesn't
contain all possibilities so it can be very good with low amounts of data. Decision trees
work better with lots of data compared to Naive Bayes.
 Naive Bayes is used a lot in robotics and computer vision, and does quite well with those
tasks. Decision trees perform very poorly in those situations. Teaching a decision tree to
recognize poker hands by looking a millions of poker hands does very poorly because
royal flushes and quads occurs so little it often gets pruned out. If it's pruned out of the
resulting tree it will misclassify those important hands (recall tall trees discussion from
above). Now just think if you are trying to diagnose cancer using this. Cancer doesn't
occur in the population in large amounts, and it will get pruned out more likely. Good
news is this can be handled by using weights so we weight a winning hand or having
cancer as higher than a losing hand or not having cancer and that boosts it up the tree
so it won't get pruned out. Again this is the part of tuning the resulting tree to the
situation that I discussed earlier.
 Decision trees are neat because they tell you what inputs are the best predicators of the
outputs so often decision trees can guide you to find if there is a statistical relationship
between a given input to the output and how strong that relationship is. Often the

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

resulting decision tree is less important than relationships it describes. So decision trees
can be used a research tool as you learn about your data so you can build other
classifiers.

Mathematics based questions

20. How Naive Bayes Algorithms works?

Working of Naïve Bayes' Classifier can be understood with the help of the below example.
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps: 1. Convert the given dataset into frequency tables. 2. Generate
Likelihood table by finding the probabilities of given features. 3. Now, use Bayes theorem to
calculate the posterior probability. Problem: If the weather is sunny, then the Player should
play or not? Solution: To solve this, first consider the below dataset:
SN Outlook Play SN Outlook Play SN Outlook Play SN Outlook Play
0 Rainy Yes 4 Sunny No 8 Rainy No 12 Overcast Yes
1 Sunny Yes 5 Rainy Yes 9 Sunny No 13 Overcast Yes
2 Overcast Yes 6 Sunny Yes 10 Sunny Yes
3 Overcast Yes 7 Overcast Yes 11 Rainy No
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4

Likelihood table weather condition:


Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.

21. Explain Bayes’ Theorem.

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

Problems/Numerical

22. Problems on application of Bayes theorem for classification

Problem 1:
Consider a car theft example. The attributes are Colour, Type, Origin, and the subject, stolen
can be either yes or no. Use Naive Bayes Classifier to classify a ―Red Domestic SUV‖.
Dataset is as below.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Solution:

Note there is no example of a Red Domestic SUV in our data set. We need to calculate the
probabilities P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) , P(Red|No) , P(SUV|No), and P(Domestic|No) and
multiply them by P(Yes) and P(No) respectively .

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Problem 2:

Problem 3:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: Support Vector Machine

Theory questions

23. What is Support Vector Machine?

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help
in creating the hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or hyperplane:

24. How does the SVM work?

SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable.
Original dataset Data with separator added Transformed data

A separator between the categories is found, and then the data are transformed in such a
way that the separator could be drawn as a hyperplane. Following this, characteristics of new

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

data can be used to predict the group to which a new record should belong. For example,
consider the following figure, in which the data points fall into two different categories. The
two categories can be separated with a curve, as shown in the figure. After the
transformation, the boundary between the two categories can be defined by a hyperplane,
as shown in the following figure.
The mathematical function used for the transformation is known as the kernel function.
Following are the popular functions.
 Linear
 Polynomial
 Radial basis function (RBF)
 Sigmoid
A linear kernel function is recommended when linear separation of the data is
straightforward. In other cases, one of the other functions should be used. You will need to
experiment with the different functions to obtain the best model in each case, as they each
use different algorithms and parameters.

25. Explain Hyperplanes and Support Vectors.

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM. The dimensions of
the hyperplane depend on the features present in the dataset, which means if there are 2
features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane. We always create a hyperplane that
has a maximum margin, which means the maximum distance between the data points.
Support Vectors: The data points or vectors that are the closest to the hyperplane and
which affect the position of the hyperplane are termed as Support Vector. Since these
vectors support the hyperplane, hence called a Support vector.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

26. Explain Hard and soft margins with the help of sketch

The distance of the vectors from the hyperplane is called the margin which is a separation of
a line to the closest class points. We would like to choose a hyperplane that maximizes the
margin between classes. The graph below shows what good margins and bad margins are.
Again Margin can be sub-divided into,
 Soft Margin – As most of the real-world data are not fully linearly separable, we will
allow some margin violation to occur which is called soft margin classification. It is better
to have a large margin, even though some constraints are violated. Margin violation
means choosing a hyperplane, which can allow some data points to stay on either the
incorrect side of the hyperplane and between the margin and correct side of the
hyperplane.
 2. Hard Margin – If the training data is linearly separable, we can select two parallel
hyperplanes that separate the two classes of data, so that the distance between them is
as large as possible.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

27. Explain Support Vector Machine terminology.

Support Vector Machines are part of the supervised learning model with an associated
learning algorithm. It is the most powerful and flexible algorithm used for classification,
regression, and detection of outliers. It is used in case of high dimension spaces; where each
data item is plotted as a point in n-dimension space such that each feature value
corresponds to the value of specific coordinate. The classification is made on the basis of a
hyperplane/line as wide as possible, which distinguishes between two categories more
clearly. Basically, support vectors are the observational points of each individual, whereas the
support vector machine is the boundary that differentiates one class from another class.
Some significant terminology of SVM is given below:
 Support Vectors: These are the data point or the feature vectors lying nearby to the
hyperplane. These help in defining the separating line.
 Hyperplane: It is a subspace whose dimension is one less than that of a decision plane.
It is used to separate different objects into their distinct categories. The best hyperplane
is the one with the maximum separation distance between the two classes.
 Margins: It is defined as the distance (perpendicular) from the data point to the decision
boundary. There are two types of margins: good margins and margins. Good margins
are the one with huge margins and the bad margins in which the margin is minor.
The main goal of SVM is to find the maximum marginal hyperplane, so as to segregate the
dataset into distinct classes. It undergoes the following steps:
 Firstly the SVM will produce the hyperplanes repeatedly, which will separate out the class
in the best suitable way.
 Then we will look for the best option that will help in correct segregation.

28. Explain Linear SVM, non-linear SVM.

Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image: So as it is 2-d space so by just using a straight line, we
can easily separate these two classes. But there can be multiple lines that can
separate these classes. Consider the below image. Hence, the SVM algorithm helps to
find the best line or decision boundary; this best boundary or region is called as a
hyperplane. SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors. The distance between the vectors and the

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

29. What are advantages and limitations of the Support Vector Machine

Advantages
 SVM‘s are very good when we have no idea on the data.
 Works well with even unstructured and semi structured data like text, Images and trees.
 The kernel trick is real strength of SVM. With an appropriate kernel function, we can
solve any complex problem.
 Unlike in neural networks, SVM is not solved for local optima.
 It scales relatively well to high dimensional data. SVM is more effective in high
dimensional spaces.
 SVM models have generalization in practice; the risk of over-fitting is less in SVM.
 SVM works relatively well when there is a clear margin of separation between classes.
 SVM is effective in cases where the number of dimensions is greater than the number of
samples.
 SVM is relatively memory efficient
Disadvantages
 Choosing a ―good‖ kernel function is not easy.
 SVM algorithm is not suitable for large data sets.
 Long training time for large datasets.
 Difficult to understand and interpret the final model, variable weights and individual
impact.
 Since the final model is not so easy to see, we cannot do small calibrations to the model
hence it‘s tough to incorporate our business logic.
 The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune these
hyper-parameters. It is hard to visualize their impact
 SVM does not perform very well when the data set has more noise i.e. target classes are
overlapping.
 In cases where the number of features for each data point exceeds the number of
training data samples, the SVM will underperform.
 As the support vector classifier works by putting data points, above and below the
classifying hyperplane there is no probabilistic explanation for the classification.

30. Explain multi class classification methods

Two different examples of this approach are the One-vs-Rest and One-vs-One strategies.
 Binary classification models like logistic regression and SVM do not support multi-class
classification natively and require meta-strategies.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 The One-vs-Rest strategy splits a multi-class classification into one binary classification
problem per class.
 The One-vs-One strategy splits a multi-class classification into one binary classification
problem per each pair of classes.

31. Explain hyper parameters of SVM.

Hyper parameters of SVM are considered as Kernel, Regularization, Gamma and Margin.
Kernel: The learning of the hyperplane in linear SVM is done by transforming the problem
using some linear algebra. This is where the kernel plays role.
For linear kernel the equation for prediction for a new input using the dot product between
the input (x) and each support vector (xi) is calculated as follows:
f(x) = B(0) + sum(ai * (x,xi))
This is an equation that involves calculating the inner products of a new input vector (x) with
all support vectors in training data. The coefficients B0 and ai (for each input) must be
estimated from the training data by the learning algorithm.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d and exponential as
K(x,xi) = exp(-gamma * sum((x — xi²)).
Polynomial and exponential kernels calculates separation line in higher dimension. This is
called kernel trick.
Regularization: The Regularization parameter (often termed as C parameter in python‘s
sklearn library) tells the SVM optimization how much you want to avoid misclassifying each
training example. For large values of C, the optimization will choose a smaller-margin
hyperplane if that hyperplane does a better job of getting all the training points classified
correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-
margin separating hyperplane, even if that hyperplane misclassifies more points. The images
below are example of two different regularization parameters. Left one has some
misclassification due to lower regularization value. Higher value leads to results like right
one.

Left: low regularization value, right: high regularization value


Gamma: The gamma parameter defines how far the influence of a single training example
reaches, with low values meaning ‗far‘ and high values meaning ‗close‘. In other words, with
low gamma, points far away from plausible separation line are considered in calculation for
the separation line. Whereas high gamma means the points close to plausible line are
considered in calculation.

High Gamma

Low Gamma

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Margin: And finally last but very important characteristic of SVM classifier. SVM to core tries
to achieve a good margin.
A margin is a separation of line to the closest class points.
A good margin is one where this separation is larger for both the classes. Images below
gives to visual example of good and bad margin. A good margin allows the points to be in
their respective classes without crossing to other class.

Hyper parameters in detail

Sr.No Parameter & Description


1 C − float, optional, default = 1.0
It is the penalty parameter of the error term.
2 kernel − string, optional, default = ‗rbf‘
This parameter specifies the type of kernel to be used in the algorithm. we can
choose any one among, ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. The
default value of kernel would be ‘rbf’.
3 degree − int, optional, default = 3
It represents the degree of the ‗poly‘ kernel function and will be ignored by all other
kernels.
4 gamma − {‗scale‘, ‗auto‘} or float,
It is the kernel coefficient for kernels ‗rbf‘, ‗poly‘ and ‗sigmoid‘.
5 optinal default − = ‗scale‘
If you choose default i.e. gamma = ‗scale‘ then the value of gamma to be used by
SVC is 1/(𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠∗𝑋.𝑣𝑎𝑟()).
On the other hand, if gamma= ‗auto‘, it uses 1/𝑛_𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠.
6 coef0 − float, optional, Default=0.0
An independent term in kernel function which is only significant in ‗poly‘ and
‗sigmoid‘.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

7 tol − float, optional, default = 1.e-3


This parameter represents the stopping criterion for iterations.
8 shrinking − Boolean, optional, default = True
This parameter represents that whether we want to use shrinking heuristic or not.
9 verbose − Boolean, default: false
It enables or disable verbose output. Its default value is false.
10 probability − boolean, optional, default = true
This parameter enables or disables probability estimates. The default value is false,
but it must be enabled before we call fit.
11 max_iter − int, optional, default = -1
As name suggest, it represents the maximum number of iterations within the solver.
Value -1 means there is no limit on the number of iterations.
12 cache_size − float, optional
This parameter will specify the size of the kernel cache. The value will be in
MB(MegaBytes).
13 random_state − int, RandomState instance or None, optional, default = none
This parameter represents the seed of the pseudo random number generated which
is used while shuffling the data. Followings are the options −
 int − In this case, random_state is the seed used by random number
generator.
 RandomState instance − In this case, random_state is the random number
generator.
 None − In this case, the random number generator is the RandonState
instance used by np.random.
14 class_weight − {dict, ‗balanced‘}, optional
This parameter will set the parameter C of class j to 𝑐𝑙𝑎𝑠𝑠_𝑤𝑒𝑖𝑔𝑕[𝑗]∗𝐶 for SVC. If we
use the default option, it means all the classes are supposed to have weight one. On
the other hand, if you choose class_weight:balanced, it will use the values of y to
automatically adjust weights.
15 decision_function_shape − ovo‘, ‗ovr‘, default = ‗ovr‘
This parameter will decide whether the algorithm will return ‘ovr’ (one-vs-rest)
decision function of shape as all other classifiers, or the original ovo(one-vs-one)
decision function of libsvm.
16 break_ties − boolean, optional, default = false
True − The predict will break ties according to the confidence values of
decision_function
False − The predict will return the first class among the tied classes.
Thus hyperparameter tuning is choosing a set of optimal hyperparameters for a learning
algorithm. A hyperparameter is a model argument whose value is set before the learning
process begins. The key to machine learning algorithms is hyperparameter tuning.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Mathematics based questions

32. Explain following Kernel functions: linear, polynomial, rbf, sigmoid

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Problems/Numerical

33. Problems based on calculating hyperplane and margin

In Support Vector Machine, there is the word vector. That means it is important to
understand vector well and how to use them.
 What is a vector?
o its norm
o its direction
 How to add and subtract vectors?
 What is the dot product?
 How to project a vector onto another?
 What is the equation of the hyperplane?
 How to compute the margin?
What is a vector?
If we define a point A(3,4) in we can plot it like this.

Definition: Any point x=(x1, x2),x≠0, in specifies a vector in the plane, namely the vector
starting at the origin and ending at x.
This definition means that there exists a vector between the origin and A.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

If we say that the point at the origin is the point O(0,0) then the vector above is the vector
⃗⃗⃗⃗⃗⃗ . We could also give it an arbitrary name such as u.
Note:
You can notice that we write vector either with an arrow on top of them, or in bold, in the
rest of this text I will use the arrow when there is two letters like ⃗⃗⃗⃗⃗⃗ and the bold notation
otherwise.
Ok so now we know that there is a vector, but we still don't know what IS a vector.
Definition: A vector is an object that has both a magnitude and a direction.
We will now look at these two concepts.
1) The magnitude
The magnitude or length of a vector x is written ∥x∥ and is called its norm.
For our vector ⃗⃗⃗⃗⃗⃗ , ∥OA∥ is the length of the segment OA

From Figure, we can easily calculate the distance OA using Pythagoras' theorem:
OA2=OB2+AB2
OA2=32+42
OA2=25
OA=5
∥OA∥=5
2) The direction
The direction is the second component of a vector.
Definition: The direction of a vector u(u1,u2) is the vector ‖ ‖ ‖ ‖

Where does the coordinates of w come from?


Understanding the definition
To find the direction of a vector, we need to use its angles.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Figure 4 displays the vector u(u1,u2) with u1=3 and u2=4


We could say that:
Naive definition 1: The direction of the vector u is defined by the angle θ with respect to the
horizontal axis, and with the angle α with respect to the vertical axis. This is tedious. Instead
of that we will use the cosine of the angles.
In a right triangle, the cosine of an angle β is defined by :
cos(β)=adjacent/hypotenuse
In Figure 4 we can see that we can form two right triangles, and in both case the adjacent
side will be on one of the axis. Which means that the definition of the cosine implicitly
contains the axis related to an angle. We can rephrase our naïve definition to :
Naive definition 2: The direction of the vector u is defined by the cosine of the angle θ and
the cosine of the angle α.
Now if we look at their values:
cos(θ) = ‖ ‖

cos(α) = ‖ ‖

Hence the original definition of the vector w. That's why its coordinates are also called
direction cosine.
Computing the direction vector
We will now compute the direction of the vector u from Figure 4.
cos(θ) = ‖ ‖
= 3/5 = 0.6

cos(α) = ‖ ‖
= 4/5 = 0.8

The direction of u(3,4) is the vector w(0.6,0.8)


If we draw this vector we get Figure 5:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

We can see that ‗w’ as indeed the same look as u except it is smaller. Something interesting
about direction vectors like w is that their norm is equal to 1. That's why we often call them
unit vectors.
The sum of two vectors

Given two vectors u(u1,u2) and v(v1,v2)


then :
u+v=(u1+v1,u2+v2)
Which means that adding two vectors gives us a third vector whose coordinate are the sum
of the coordinates of the original vectors. You can convince yourself with the example below:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

The difference between two vectors


The difference works the same way: u−v=(u1−v1,u2−v2)

Since the subtraction is not commutative, we can also consider the other case:
v−u=(v1−u1,v2−u2)

The last two pictures describe the "true" vectors generated by the difference of u and v.
However, since a vector has a magnitude and a direction, we often consider that parallel
translate of a given vector (vectors with the same magnitude and direction but with a
different origin) are the same vector, just drawn in a different place in space.
So don't be surprised if you meet the following:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

If you do the math, it looks wrong, because the end of the vector u−v is not in the right
point, but it is a convenient way of thinking about vectors which you'll encounter often.
The dot product
One very important notion to understand SVM is the dot product.
Definition: Geometrically, it is the product of the Euclidian magnitudes of the two vectors
and the cosine of the angle between them
Which means if we have two vectors x and y and there is an angle θ (theta) between them,
their dot product is:
x⋅y=∥x∥∥y∥cos(θ)
Why?
To understand let's look at the problem geometrically.

In the definition, they talk about cos(θ), let's see what it is.
By definition we know that in a right-angled triangle:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

cos(θ)=adjacent/hypotenuse
In our example, we don't have a right-angled triangle.
However if we take a different look Figure 12 we can find two right-angled triangles formed
by each vector with the horizontal axis.

So now we can view our original schema like this:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

We can see that θ=β−α


So computing cos(θ) is like computing cos(β−α)
There is a special formula called the difference identity for cosine which says that:
cos(β−α)=cos(β)cos(α)+sin(β)sin(α)
Let's use this formula!

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

A few words on notation


The dot product is called like that because we write a dot between the two vectors. Talking
about the dot product x⋅y is the same as talking about the inner product ⟨x,y⟩ (in linear
algebra) scalar product because we take the product of two vectors and it returns a scalar (a
real number)
The orthogonal projection of a vector
Given two vectors x and y, we would like to find the orthogonal projection of x onto y.

To do this we project the vector x onto y

This gives us the vector z

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

The SVM hyperplane: Understanding the equation of the hyperplane


You probably learnt that an equation of a line is: y=ax+b. However when reading about
hyperplane, you will often find that the equation of a hyperplane is defined by:
wTx=0
How does these two forms relate?
In the hyperplane equation you can see that the name of the variables is in bold. Which
means that they are vectors? Moreover, wTx is how we compute the inner product of two
vectors, and if you recall, the inner product is just another name for the dot product!
Note that y=ax+b is the same thing as y−ax−b=0

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

For two reasons:


 it is easier to work in more than two dimensions with this notation,
 the vector w will always be normal to the hyperplane(Note: I received a lot of
questions about the last remark. W will always be normal because we use this vector
to define the hyperplane, so by definition it will be normal. As you can see this page,
when we define a hyperplane, we suppose that we have a vector that is orthogonal to
the hyperplane)
And this last property will come in handy to compute the distance from a point to the
hyperplane.
Compute the distance from a point to the hyperplane
In Figure 20 we have a hyperplane, which separates two groups of data.

To simplify this example, we have set w0=0.


As you can see on the Figure 20, the equation of the hyperplane is:
x2=−2x1 which is equivalent to
wTx=0

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

We can view point A as a vector from origin to A. If we project it onto the normal vector w

We get the vector p

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: 34. Logistic Regression

Theory questions

34. Why use Logistic Regression?

 Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
 Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).The curve from the logistic
function indicates the likelihood of something such as whether the cells are cancerous or
not, a mouse is obese or not based on its weight, etc.
 Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.

35. Explain principle of Logistic Regression.

Logistic Function (Sigmoid Function):


 The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
 It maps any real value into another value within a range of 0 and 1.
 The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the sigmoid
function or the logistic function.
 In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.
Assumptions for Logistic Regression:
 The dependent variable must be categorical in nature.
 The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
We know the equation of the straight line can be written as:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

36. State types of Logistic Regression?

On the basis of the categories, Logistic Regression can be classified into three types:
 Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

37. What are advantages and limitations of the Logistic Regression?

Advantages Disadvantages
Logistic regression is easier to implement, If the number of observations is lesser than
interpret, and very efficient to train. the number of features, Logistic Regression
should not be used; otherwise, it may lead to
overfitting.
It makes no assumptions about distributions It constructs linear boundaries.
of classes in feature space.
It can easily extend to multiple classes Limitation of Logistic Regression is the
(multinomial regression) and a natural assumption of linearity between dependent
probabilistic view of class predictions. variable and independent variables.
It not only provides a measure of how It can only be used to predict discrete
appropriate a predictor (coefficient size)is, functions. Hence, the dependent variable of
but also its direction of association (positive Logistic Regression is bound to the discrete
or negative). number set.
It is very fast at classifying unknown records. Non-linear problems can‘t be solved with
logistic regression because it has a linear
decision surface. Linearly separable data is
rarely found in real-world scenarios.
Good accuracy for many simple data sets Logistic Regression requires average or no
and it performs well when the dataset is multicollinearity between independent
linearly separable. variables.
It can interpret model coefficients as It is tough to obtain complex relationships
indicators of feature importance. using logistic regression. More powerful and
compact algorithms such as Neural Networks
can easily outperform this algorithm.
Logistic regression is less inclined to over- In Linear Regression independent and
fitting but it can overfit in high dimensional dependent variables are related linearly. But
datasets.One may consider Regularization Logistic Regression needs that independent
(L1 and L2) techniques to avoid over-fittingin variables are linearly related to the log odds
these scenarios. (log(p/(1-p)).

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

38. Differentiate between logistic regression and linear regression?

Linear Regression Logistic Regression


Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a categorical dependent variable using a given set of
given set of independent variables. independent variables.
Linear Regression is used for solving Logistic regression is used for solving Classification
Regression problem. problems.
In Linear regression, we predict the In logistic Regression, we predict the values of
value of continuous variables. categorical variables.
In linear regression, we find the best fit In Logistic Regression, we find the S-curve by which
line, by which we can easily predict the we can classify the samples.
output.
Least square estimation method is Maximum likelihood estimation method is used for
used for estimation of accuracy. estimation of accuracy.
The output for Linear Regression must The output of Logistic Regression must be a
be a continuous value, such as price, Categorical value such as 0 or 1, Yes or No, etc.
age, etc.
In Linear regression, it is required that In Logistic regression, it is not required to have the
relationship between dependent linear relationship between the dependent and
variable and independent variable independent variable.
must be linear.
In linear regression, there may be In logistic regression, there should not be
collinearity between the independent collinearity between the independent variable.
variables.
Linear Regression is a supervised Logistic Regression is a supervised classification
regression model. model.
In Linear Regression, we predict the In Logistic Regression, we predict the value by 1 or
value by an integer number. 0.
Here no activation function is used. Here activation function is used to convert a linear
regression equation to the logistic regression
equation
Here no threshold value is needed. Here a threshold value is added.
Here we calculate Root Mean Square Here we use precision to predict the next weight
Error(RMSE) to predict the next weight value.
value.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Here dependent variable should be Here the dependent variable consists of only two
numeric and the response variable is categories. Logistic regression estimates the odds
continuous to value. outcome of the dependent variable given a set of
quantitative or categorical independent variables.
It is based on the least square It is based on maximum likelihood estimation.
estimation.
Here when we plot the training Any change in the coefficient leads to a change in
datasets, a straight line can be drawn both the direction and the steepness of the logistic
that touches maximum plots. function. It means positive slopes result in an S-
shaped curve and negative slopes result in a Z-
shaped curve.
Linear regression is used to estimate Whereas logistic regression is used to calculate the
the dependent variable in case of a probability of an event. For example, classify if
change in independent variables. For tissue is benign or malignant.
example, predict the price of houses.
Linear regression assumes the normal Logistic regression assumes the binomial
or gaussian distribution of the distribution of the dependent variable.
dependent variable.

The Similarities between Linear Regression and Logistic Regression


 Linear Regression and Logistic Regression both are supervised Machine Learning
algorithms.
 Linear Regression and Logistic Regression, both the models are parametric
regression i.e. both the models use linear equations for predictions
That‘s all the similarities we have between these two models.
However, functionality-wise these two are completely different. Following are the
differences.
The Differences between Linear Regression and Logistic Regression
 Linear Regression is used to handle regression problems whereas Logistic regression
is used to handle the classification problems.
 Linear regression provides a continuous output but Logistic regression provides
discreet output.
 The purpose of Linear Regression is to find the best-fitted line while Logistic
regression is one step ahead and fitting the line values to the sigmoid curve.
 The method for calculating loss function in linear regression is the mean squared
error whereas for logistic regression it is maximum likelihood estimation.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Mathematics based questions

39. Cost function

The Cost function of linear regression

Cost function of Logistic Regression

The i indexes have been removed for clarity. In words this is the cost the algorithm pays if it
predicts a value hθ(x) while the actual cost label turns out to be y. By using this function we
will grant the convexity to the function the gradient descent algorithm has to process, as
discussed above. There is also a mathematical proof for that, which is outside the scope of
this introductory course. In case y=1, the output (i.e. the cost to pay) approaches to 0 as

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

hθ(x) approaches to 1. Conversely, the cost to pay grows to infinity as hθ(x) approaches to 0.
This is a desirable property: we want a bigger penalty as the algorithm predicts something
far away from the actual value. If the label is y=1 but the algorithm predicts hθ(x)=0, the
outcome is completely wrong. Conversely, the same intuition applies when y=0, depicted in
the plot 2. below, right side. Bigger penalties when the label is y=0 but the algorithm
predicts hθ(x)=1. The above two functions can be compressed into a single function i.e.

Theory Mathematics Numerical


Topic: K-Means & K-Nearest Neighbor (KNN)

Theory questions

40. What is meant by K Nearest Neighbor algorithm?

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

41. Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

42. How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
 Step-1: Select the number K of the neighbors
 Step-2: Calculate the Euclidean distance of K number of neighbors
 Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
 Step-4: Among these k neighbors, count the number of the data points in each category.
 Step-5: Assign the new data points to that category for which the number of the
neighbour is maximum.
 Step-6: Our model is ready.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Suppose we have a new data point and we need to put it in the required category. Consider
the below image:

 Firstly, we will choose the number of neighbors, so we will choose the k=5.
 Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

 By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

 As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

43. What is the difference between KNN and K means?

 K-NN is a Supervised while K-means is an unsupervised Learning.


 K-NN is a classification or regression machine learning algorithm while K-means is a
clustering machine learning algorithm.
 K-NN is a lazy learner while K-Means is an eager learner. An eager learner has a model
fitting that means a training step but a lazy learner does not have a training phase.
 K-NN performs much better if all of the data have the same scale but this is not true for
K-means.
 K-means is a clustering algorithm that tries to partition a set of points into K sets
(clusters) such that the points in each cluster tend to be near each other. It is
unsupervised because the points have no external classification.
 K-nearest neighbors is a classification (or regression) algorithm that in order to
determine the classification of a point, combines the classification of the K nearest
points. It is supervised because you are trying to classify a point based on the known
classification of other points.

44. What is K-means used for?

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters. The below diagram explains the working of the K-means Clustering Algorithm:

45. How does K-means work?

The working of the K-Means algorithm is explained in the below steps:


 Step-1: Select the number K to decide the number of clusters.
 Step-2: Select random K points or centroids. (It can be other from the input dataset).
 Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
 Step-4: Calculate the variance and place a new centroid of each cluster.
 Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
 Step-7: The model is ready.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 Let's take number k of clusters, K=2, to identify dataset and to put them into different
clusters. It means here we will try to group these datasets into 2 different clusters.
 We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:

 Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:

 From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

 Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:\

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
 We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

 We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

46. Is K nearest neighbor supervised or unsupervised?

KNN is a simple supervised learning algorithm.


KNN works on a basic assumption that data points of similar classes are closer to each other.
Now suppose you have a classification problem to identify whether a given data point is of
class A or class B and your aim is to classify test datapoints into these classes for which you
have a training dataset of alreday classified data points.
KNN assignes 1/k probability to ‗k‘ nearest pre classified training data point from our test
data point and 0 probability to rest of data points, where k may be any number (but try to
put it odd to avoid tie cases). After that we count the number of each classes (i.e A and B)
out of those K points and our test data point will be classified as that class whose count is
greater.
For example if we are using k=5 then our algorithm looks for 5 nearest point from our test
dataset point and count the number of each class out of those 5 point.Suppose our counting
results in A=3 and B=2 then KNN will assign class A to that test dataset point.
KNN is categorized under supervised ML techniques. It works assuming that similar classes
data points are near one another. Take a case of data points classification, where they fit?
Class A or B? For this you have a classified data points training set. KNN assumes a
probability of 1/k for 'k' closest data point, then assumes zero probability for other data
points.
Now k could be any digit, let‘s count the classes (A, B) out of the k points, the test data
points get classified as the class with higher count. If k = 6, then the algorithm searches for
the nearest point in the dataset, and count instances of every class out of the 6 point. Now
assume counting turns out to be A = 2, B = 1, then KNN will allot A to the data set point.
In this scenario, Data points classification is done considering their proximity with known
Data points classes, thus KNN is a supervised ML algorithm.

47. What are advantages and limitations of KNN and K means?

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm
that can be used to solve both classification and regression problems. It‘s easy to implement
and understand, but has a major drawback of becoming significantly slows as the size of that
data in use grows. KNN works by finding the distances between a query and all the examples
in the data, selecting the specified number examples (K) closest to the query, then votes for
the most frequent label (in the case of classification) or averages the labels (in the case of
regression). In the case of classification and regression, we saw that choosing the right K for
our data is done by trying several Ks and picking the one that works best.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Advantages
 The algorithm is simple and easy to implement.
 There‘s no need to build a model, tune several parameters, or make additional
assumptions.
 The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
Disadvantages
 The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.

Mathematics based questions

48. How to choose right value for K in KNN?

To select the K that‘s right for your data, we run the KNN algorithm several times with
different values of K and choose the K that reduces the number of errors we encounter while
maintaining the algorithm‘s ability to accurately make predictions when it‘s given data it
hasn‘t seen before. Here are some things to keep in mind:
 As we decrease the value of K to 1, our predictions become less stable. Just think for a
minute, imagine K=1 and we have a query point surrounded by several reds and one
green (I‘m thinking about the top left corner of the colored plot above), but the green is
the single nearest neighbor. Reasonably, we would think the query point is most likely
red, but because K=1, KNN incorrectly predicts that the query point is green.
 Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions (up
to a certain point). Eventually, we begin to witness an increasing number of errors. It is at
this point we know we have pushed the value of K too far.
 In cases where we are taking a majority vote (e.g. picking the mode in a classification
problem) among labels, we usually make K an odd number to have a tiebreaker.

49. How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
 It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
 For each value of K, calculates the WCSS value.
 Plots a curve between calculated WCSS values and the number of clusters K.
 The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.

Feel free to contact me on +91-8329347107 calling / +91-9922369797 whatsapp,


email ID: [email protected] and [email protected]

*********************

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


Artificial Intelligence & Machine Learning
Course Code: 302049

Unit 4: Development of ML Model


Third Year Bachelor of Engineering (Choice Based Credit System)
Mechanical Engineering (2019 Course)
Board of Studies – Mechanical and Automobile Engineering, SPPU, Pune
(With Effect from Academic Year 2021-22)

Question bank and its solution


by

Abhishek D. Patange, Ph.D.


Department of Mechanical Engineering
College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

Unit 4: DEVELOPMENT OF ML MODEL


Syllabus:
Content Theory Mathematics Numerical
• Development of ML model
Problem identification
Data Collection, Data pre-processing, Model
Selection (C & R)
Model training, Model evaluation, Hyper
parameter Tuning, Predictions (C & R)
• Development of ML models to solve mechanical engineering problems
Note: ‘C’ stands for classification and ‘R’ stands for regression

Type of question and marks:


Type Theory Mathematics Numerical
Marks 2 or 4 or 6 marks 4 marks 2 or 4 marks

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

Theory Mathematics Numerical


Topic: Problem identification

Theory questions

1. What are four typical problems to be solved using machine learning

approach?

 Regression
If the prediction value tends to be a continuous value then it falls under Regression
type problem in machine learning. Giving area name, size of land, etc. as features and
predicting expected cost of the land.
 Classification
If the prediction value tends to be category/discrete like yes/no , positive/negative , etc.
then it falls under classification type problem in machine learning. Given a sentence
predicting whether it is negative or positive review
 Clustering
Grouping a set of points to given number of clusters. Given 3, 4, 8, 9 and number of
clusters to be 2 then the ML system might divide the given set into cluster 1 - 3, 4 and
cluster 2 - 8, 9
 Ranking
Used for constructing a ranker from a set of labelled examples. This example set
consists of instance groups that can be scored with a given criteria. The ranking labels
are { 0, 1, 2, 3, 4 } for each instance. The ranker is trained to rank new instance groups
with unknown scores for each instance.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

2. Represent classification, regression, and clustering on two-dimensional plane

pictorially.

 Regression means the relationship between 2 “things” (one variable-dependent related


to one variable-independent or groups of variable dependent against the group of
independent as well as a combination 1:n variables, called multivariate regression).
Regression “sees” relationship.
 Classification and clustering both manage groups of something according some
criteria(s). Example is grouping by gender, age, some preferences. Difference is easy to
see: In classification you define the factors that differentiate the population and put each
individual in a specific “drawer” according to your grouping criteria.
 The criteria can be single (one unique factor of differentiation) or by a combination of
factors, e.g. gender & birth city). You are classifying the sample.
 In clustering, the process senses differences based on value of variables and assign a
specific “drawer” to each case of the sample. After this separation based on math and
value of variables, the researcher will try to validate if the grouping has a human logical
reason and, if possible, characterize it by human feeling.
 This is very tricky because all the process is essentially numbers because math is unable
to “read” the label of each factor and assign human meaning.
 Let‟s see an example: consider you measure customer satisfaction using a group of
variables (price, quality, service, delivery time, gender, age, education, …) that is
supposed to segment the sample in groups based on value of those variables.
 You can define how many groups you want to have, or some tools show it graphically in
order to allow the researcher decide what configuration (by number of groups called
clusters) would be convenient. Regression measures relationship, classification you
define criteria of grouping and separate by that.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Clustering the groups is defined by “distance” among sample cases and at the end,
researcher looks for some meaning of that grouping. Regression, classification and
clustering are based on sample content and the result reflects that sample.
 To extrapolate the conclusion to population requires validation according to the Level of
Confidence you are willing to assume and additional math processes need to be done.

3. Why is 'clustering' not 'classification'? Give example.

 Usually classification is referred to as the problem of observing data and deciding to


which class each item belongs (e.g. cat or dog).
 Typically the classes are known in advance - something we care about.
 Clustering is typically referred to when we have data but no labels that are
associated with each item (i.e. no predefined notion of cats and dogs).
 In this case the problem is to cluster the items in to groups so that items in each
group "resemble each other".
 It can be that a particular algorithm will classify, for example, pet images into cats and
dogs but it may also end up clustering them to other groups (based on colour, pose, size
or any other characteristic).
 The key is that clustering is typically done in the unsupervised domain, that is: there
are no predefined classes known in advance.
Examples of classification and clustering are as follows.
Classification
 Email classification: Spam or non-Spam
 Sanction loan to customer: Yes if he is capable of paying EMI for the sanctioned loan
amount. No if he can't
 Cancer tumour cells identification: Is it critical or non-critical?
 Sentiment analysis of tweets: Is the tweet positive or negative or neutral
 Classification of news: Classify the news into one of predefined classes - Politics, Sports,
Health etc.
Clustering
 Marketing: Discover customer segments for marketing purposes
 Biology: Classification among different species of plants and animals
 Libraries: Clustering different books on the basis of topics and information
 Insurance: Acknowledge the customers, their policies and identifying the frauds
 City Planning: Make groups of houses and to study their values based on their
geographical locations and other factors.
 Earthquake studies: Identify dangerous zones

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

4. Differentiate between clustering and classification.

Parameter CLASSIFICATION CLUSTERING


Type used for supervised learning used for unsupervised learning
Basic process of classifying the input grouping the instances based on
instances based on their their similarity without the help of
corresponding class labels class labels
Need it has labels so there is need of there is no need of training and
training and testing dataset for testing dataset
verifying the model created
Complexity more complex as compared to less complex as compared to
clustering classification
Uses It uses algorithms to categorize the It uses statistical concepts in which
new data as per the observations of the data set is divided into subsets
the training set. with the same features.
Objective Its objective is to find which class a Its objective is to group a set of
new object belongs to form the set objects to find whether there is any
of predefined classes. relationship between them.
Example Logistic regression, Naive Bayes k-means clustering algorithm, Fuzzy
Algorithms classifier, Support vector machines, c-means clustering algorithm,
etc. Gaussian (EM) clustering algorithm,
etc.

5. Differentiate between regression and classification.

Regression Classification
In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is The task of the classification algorithm is to map
to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

Regression algorithms can be used to Classification Algorithms can be used to solve


solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification
prediction, etc. of cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.
Method of evaluation by measurement
Method of evaluation by measuring accuracy
of root mean square error
Nature of the predicted data Ordered Nature of the predicted data Unordered

6. Explain terminology of understanding ML based classification and regression

model.

 Algorithm = A method, function, or series of instructions used to generate a machine


learning model. Examples include linear regression, decision trees, support vector
machines, and neural networks.
 Attribute = A quality describing an observation (e.g. color, size, weight). In Excel terms,
these are column headers.
 Categorical Variables = Variables with a discrete set of possible values. Can be ordinal
(order matters) or nominal (order doesn‟t matter).

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Classification = Predicting a categorical output.


 Binary classification = predicts one of two possible outcomes (e.g. is the email spam or
not spam?)
 Multi-class classification = predicts one of multiple possible outcomes (e.g. is this a
photo of a cat, dog, horse or human?)
 Classification Threshold = The lowest probability value at which we‟re comfortable
asserting a positive classification. For example, if the predicted probability of being
diabetic is > 50%, return True, otherwise return False.
 Classifier =It is an algorithm that is used to map the input data to a specific category.
 Classification Model = The model predicts or draws a conclusion to the input data
given for training, it will predict the class or category for the data.
 Binary Classification = It is a type of classification with two outcomes, for e.g. – either
true or false.
 Multi-Class Classification = The classification with more than two classes, in multi-class
classification each sample is assigned to one and only one label or target.
 Multi-label Classification = This is a type of classification where each sample is
assigned to a set of labels or targets.
 Clustering = Unsupervised grouping of data into buckets.
 Confusion Matrix = Table that describes the performance of a classification model by
grouping predictions into 4 categories.
 True Positives: we correctly predicted they do have diabetes
 True Negatives: we correctly predicted they don‟t have diabetes
 False Positives: we incorrectly predicted they do have diabetes (Type I error)
 False Negatives: we incorrectly predicted they don‟t have diabetes (Type II error)
 Continuous Variables = Variables with a range of possible values defined by a number
scale (e.g. sales, lifespan).
 Convergence = A state reached during the training of a model when the loss changes
very little between each iteration.
 Epoch = An epoch describes the number of times the algorithm sees the entire data set.
 Feature = With respect to a dataset, a feature represents an attribute and value
combination. Color is an attribute. “Color is blue” is a feature. In Excel terms, features are
similar to cells. The term feature has other definitions in different contexts.
 Feature Selection = Feature selection is the process of selecting relevant features from
a data-set for creating a Machine Learning model.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Feature Vector = A list of features describing an observation with multiple attributes. In


Excel we call this a row.
 Hyperparameters = Hyperparameters are higher-level properties of a model such as
how fast it can learn (learning rate) or complexity of a model. The depth of trees in a
Decision Tree or number of hidden layers in a Neural Networks are examples of hyper
parameters.
 Instance = A data point, row, or sample in a dataset. Another term for observation.
 Label = The “answer” portion of an observation in supervised learning. For example, in a
dataset used to classify flowers into different species, the features might include the
petal length and petal width, while the label would be the flower‟s species.
 Model = A data structure that stores a representation of a dataset (weights and biases).
Models are created/learned when you train an algorithm on a dataset.
 Neural Networks = Neural Networks are mathematical algorithms modeled after the
brain‟s architecture, designed to recognize patterns and relationships in data.
 Normalization = Restriction of the values of weights in regression to avoid overfitting
and improving computation speed.
 Noise = Any irrelevant information or randomness in a dataset which obscures the
underlying pattern.
 Observation = A data point, row, or sample in a dataset. Another term for instance.
 Outlier = An observation that deviates significantly from other observations in the
dataset.
 Overfitting = Overfitting occurs when your model learns the training data too well and
incorporates details and noise specific to your dataset. You can tell a model is overfitting
when it performs great on your training/validation set, but poorly on your test set (or
new real-world data).
 Regression = Predicting a continuous output (e.g. price, sales).
 Supervised Learning = Training a model using a labeled dataset.
 Test Set = A set of observations used at the end of model training and validation to
assess the predictive power of your model. How generalizable is your model to unseen
data?
 Training Set = A set of observations used to generate machine learning models.
 Underfitting = Underfitting occurs when your model over-generalizes and fails to
incorporate relevant variations in your data that would give your model more predictive
power. You can tell a model is underfitting when it performs poorly on both training and
test sets.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Unsupervised Learning = Training a model to find patterns in an unlabeled dataset (e.g.


clustering).
 Validation Set = A set of observations used during model training to provide feedback
on how well the current parameters generalize beyond the training set. If training error
decreases but validation error increases, your model is likely overfitting and you should
pause training.

7. Enlist steps involved in development of classification model.

Following are the steps to be considered in development of classification model.


1 - Data Collection
 The quantity & quality of your data dictate how accurate our model is
 The outcome of this step is generally a representation of data which we will use for
training
 Using experimental data, data generated by simulations, pre-collected data, by way
of datasets from Kaggle, UCI, etc., still fits into this step
2 - Data Preparation
 Wrangle data and prepare it for training
 Clean that which may require it (remove duplicates, correct errors, deal with missing
values, normalization, and data type conversions, etc.)
 Randomize data, which erases the effects of the particular order in which we
collected and/or otherwise prepared our data
 Visualize data to help detect relevant relationships between variables or class
imbalances (bias alert!), or perform other exploratory analysis
 Split into training and evaluation sets
3 - Choose a Model
 Different algorithms are for different tasks; choose the right one
4 - Train the Model
 The goal of training is to answer a question or make a prediction correctly as often as
possible
 Linear regression example: algorithm would need to learn values for m (or W) and b
(x is input, y is output)
 Each iteration of process is a training step
5 - Evaluate the Model
 Uses some metric or combination of metrics to "measure" objective performance of
model
 Test the model against previously unseen data

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 This unseen data is meant to be somewhat representative of model performance in


the real world, but still helps tune the model (as opposed to test data, which does
not)
 Good train/evaluation split? 80/20, 70/30, or similar, depending on domain, data
availability, dataset particulars, etc.
6 – Hyper parameter Tuning
 This step refers to hyperparameter tuning, which is an "artform" as opposed to a
science
 Tune model parameters for improved performance
 Simple model hyperparameters may include: number of training steps, learning rate,
initialization values and distribution, etc.
7 - Make Predictions
 Using further (test set) data which have, until this point, been withheld from the
model (and for which class labels are known), are used to test the model; a better
approximation of how the model will perform in the real world

8. Enlist steps involved in development of regression model.

 Regression Analysis is an analytical process whose end goal is to understand the inter-
relationships in the data and find as much useful information as possible.
 According to the book, there are a number of steps which are loosely detailed below.
1 - Problem definition
 The very first step is to, off course; define the problem we are trying to solve. Perhaps a
business question that needs to be answered or simply a prediction we want to make
based on some set of data. In this stage we must know the target variable and the
attributes we presume affects the target variable. This would be later analysed to judge
its credibility. For the sake of our discussion let‟s take the Titanic Dataset as an example.
 In this dataset we have data of about 900 passengers. The question or the problem we
must solve is predicting which passenger likely survived the tragedy given their data.

A look at the Titanic Dataset


 So now we know, that „Survival‟ is the response variable but of the 10 attributes given for
each passenger, how do we determine which of these predictor variables affect the
result? That‟s where data analysis comes in .

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

2 - Analyse Data
The key is to have visual representations of our data so we can better understand the „inter-
relationships‟ of the variables and likely so, the book I was referring to earlier, highly
recommends using visual tools to make the EDA(Exploratory Data Analysis) process easier.
For the afore-mentioned dataset, we could try answering a number of things that might give
us a better understanding of the problem at hand. What‟s the survival rate of passengers
from each class?

Graphs and charts

How about the survival rate based on gender?

How about the Correlation of all the attributes?

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

Heatmap showing correlation


Finding correlation is an important step as it allows us to roughly pick the attributes that
have a relation with the response variable. We are most likely to pick the attributes/variables
that show a positive correlation with respect to the target variable. From this section we can
deduce that plotting graphs are vital for the next step which is choosing a model. Graphs
before model fitting can range from histograms, boxplots, root and leaf display, scatter plots
etc.
3 - Model Selection
Based on the data, we are to pick a suitable model or regression equation. You may be
familiar with many such models like Linear Regression, Support Vector Machine, Random
Forest etc. The task in this step is to pick one that we assume will express the relationships of
our data in the best way possible. This assumption can be later accepted or refuted based on
analysis after fitting the model.
4 - Model Fitting
For simplicity‟s sake, let‟s consider linear regression. Y= mx+c. We have the data, we have a
model. At this stage we are going to train the model on the given dataset but what of the
parameters of this equation? We must estimate these parameters when fitting the model
however they can be optimised with many algorithms. Perhaps this is when terms like
„Gradient Descent‟ or „Adam optimiser‟ rings a bell. The purpose of an optimiser is simply to
update the values in every iteration of training so we can minimise loss or error. This is the
part where our model learns to correct itself and provide a best fitting solution or model that
would likely have high accuracy. For a simple model like linear regression, we can use Least
Squares method to estimate the parameters „m (slope)’ and ‘c (y-intercept)’ to get the
best fit line that crosses through most of the data points. The least squares method basically

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

minimizes the sum of the square of the errors as small as possible given that no outliers are
present in the data.
5 - Model evaluation
Final step is model evaluation - measuring and criticising exactly how well is the model
fitting the data points. We run the model on the test data and check to see how accurately it
was able to predict the output values. Now, there are a number of measures to check this as
discussed below:
i) We can find RMSE (root mean squared error) of the actual Y values and predicted Y values.
There are other variations of it that can be explored.

Formula for RMSE


ii) We can calculate R-squared value which measures the goodness of fit or varaince within a
range of 0 to 1 where ideal value is 1.

Formula to find R-squared value


iii) We can perform cross validation to asses which model among a few chosen performed
the best for our given problem.
iv) Finding statistical significance of parameters. This involves stating a hypothesis, a null
hypothesis and an alpha level (probability of error level). An example is Chi-squared Test
which tests if there is any relation between two variables.

Formula for Chi-Square Statistical Test

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

There are many other methods, some more complex than others but these are usually a
good place to start. Based on this analysis, the model is updated and perfected after which it
can be used for its intended purpose.

9. What are the sources of the data that is needed for training the

classification model for identifying cutting tool state in

milling/drilling/lathe?

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

10. What are the sources of the data that is needed for training the

classification model for condition monitoring of bearings, gears, rotating

elements?

11. What are the sources of the data that is needed for training the ML model

for End-to-End Autonomous Driving With Scene Understanding?

12. What are the sources of the data that is needed for training the ML model

for accurately predicting the device’s position in three dimensions (3D

motion sensing using Sensor Fusion)?

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

13. You have given a task of developing a classification model for identifying

cutting tool state in milling/drilling/lathe. So initially you will collect the

data corresponds to tool state either as healthy or faulty. So what are the

possible sources of the data that is needed for said application?

Questions 9-12 can be rephrased as question 13. So it is same.

14. What is training data? What is labeled data? What is unlabeled data? What

are key steps involved in developing training data?

 Machine learning models are as good as the data they're trained on. Without
high-quality training data, even the most efficient machine learning algorithms
will fail to perform.
 The need for quality, accurate, complete, and relevant data starts early on in the
training process.
 Only if the algorithm is fed with good training data can it easily pick up the
features and find relationships that it needs to predict down the line.
 More precisely, quality training data is the most significant aspect of machine
learning (and artificial intelligence) than any other.
 If you introduce the machine learning (ML) algorithms to the right data, you're
setting them up for accuracy and success.
 Training data is the initial dataset used to train machine learning algorithms.
Models create and refine their rules using this data. It's a set of data samples used
to fit the parameters of a machine learning model to training it by example.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Training data is also known as training dataset, learning set, and training set. It's
an essential component of every machine learning model and helps them make
accurate predictions or perform a desired task.
 Simply put, training data builds the machine learning model. It teaches what the
expected output looks like. The model analyzes the dataset repeatedly to deeply
understand its characteristics and adjust itself for better performance.
 In a broader sense, training data can be classified into two categories: labeled
data and unlabeled data.

 Labeled data is a group of data samples tagged with one or more meaningful
labels. It's also called annotated data, and its labels identify specific
characteristics, properties, classifications, or contained objects.
 For example, the images of fruits can be tagged as apples, bananas, or grapes.
 Labeled training data is used in supervised learning. It enables ML models to
learn the characteristics associated with specific labels, which can be used to
classify newer data points. In the example above, this means that a model can use
labeled image data to understand the features of specific fruits and use this
information to group new images.
 Data labeling or annotation is a time-consuming process as humans need to tag
or label the data points. Labeled data collection is challenging and expensive. It
isn't easy to store labeled data when compared to unlabeled data.
 As expected, unlabeled data is the opposite of labeled data. It's raw data or data
that's not tagged with any labels for identifying classifications, characteristics, or
properties. It's used in unsupervised machine learning, and the ML models have
to find patterns or similarities in the data to reach conclusions.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Going back to the previous example of apples, bananas, and grapes, in unlabeled
training data, the images of those fruits won't be labeled. The model will have to
evaluate each image by looking at its characteristics, such as color and shape.
 After analyzing a considerable number of images, the model will be able to
differentiate new images (new data) into the fruit types of apples, bananas, or
grapes. Of course, the model wouldn't know that the particular fruit is called an
apple. Instead, it knows the characteristics needed to identify it.
 There are hybrid models that use a combination of supervised and unsupervised
machine learning.

15. How does training data used in machine learning?

 Unlike machine learning algorithms, traditional programming algorithms follow a set of


instructions to accept input data and provide output. They don't rely on historical data,
and every action they make is rule-based. This also means that they don't improve over
time, which isn't the case with machine learning.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 For machine learning models, historical data is fodder. Just as humans rely on past
experiences to make better decisions, ML models look at their training dataset with past
observations to make predictions.
 Predictions could include classifying images as in the case of image recognition, or
understanding the context of a sentence as in natural language processing (NLP).
 Think of a data scientist as a teacher, the machine learning algorithm as the student, and
the training dataset as the collection of all textbooks.
 The teacher‟s aspiration is that the student must perform well in exams and also in the
real world. In the case of ML algorithms, testing is like exams. The textbooks (training
dataset) contain several examples of the type of questions that‟ll be asked in the exam.
 Of course, it won‟t contain all the examples of questions that‟ll be asked in the exam, nor
will all the examples included in the textbook will be asked in the exam. The textbooks
can help prepare the student by teaching them what to expect and how to respond.
 No textbook can ever be fully complete. As time passes, the kind of questions asked will
change, and so, the information included in the textbooks needs to be changed. In the
case of ML algorithms, the training set should be periodically updated to include new
information.
 In short, training data is a textbook that helps data scientists give ML algorithms an idea
of what to expect. Although the training dataset doesn't contain all possible examples,
it‟ll make algorithms capable of making predictions.

16. What makes training data good?

High-quality data translates to accurate machine learning models. Low-quality data can
significantly affect the accuracy of models, which can lead to severe financial losses. It's
almost like giving a student textbook containing wrong information and expecting them to
excel in the examination. The following are the four primary traits of quality training data.
 Relevant
The data needs to be relevant to the task at hand. For example, if you want to train
a computer vision algorithm for autonomous vehicles, you probably won't require images of
fruits and vegetables. Instead, you would need a training dataset containing photos of roads,
sidewalks, pedestrians, and vehicles.
 Representative
The AI training data must have the data points or features that the application is made to
predict or classify. Of course, the dataset can never be absolute, but it must have at least the
attributes the AI application is meant to recognize. For example, if the model is meant to
recognize faces within images, it must be fed with diverse data containing people's faces

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

from various ethnicities. This will reduce the problem of AI bias, and the model won't be
prejudiced against a particular race, gender, or age group.
 Uniform
All data should have the same attribute and must come from the same source. Suppose your
machine learning project aims to predict churn rate by looking at customer information. For
that, you'll have a customer information database that includes customer name, address,
number of orders, order frequency, and other relevant information. This is historical data and
can be used as training data. One part of the data can't have additional information, such as
age or gender. This will make training data incomplete and the model inaccurate. In short,
uniformity is a critical aspect of quality training data.
 Comprehensive
Again, the training data can never be absolute. But it should be a large dataset that
represents the majority of the model's use cases. The training data must have enough
examples that‟ll allow the model to learn appropriately. It must contain real-world data
samples as it will help train the model to understand what to expect. If you're thinking of
training data as values placed in large numbers of rows and columns, sorry, you're wrong. It
could be any data type like text, images, audio, or videos.

17. What affects training data quality?

Humans are highly social creatures, but there are some prejudices that we might have picked
as children and require constant conscious effort to get rid of. Although unfavourable, such
biases may affect our creations, and machine learning applications are no different. For ML
models, training data is the only book they read. Their performance or accuracy will depend
on how comprehensive, relevant, and representative the very book is. That being said, three
factors affect the quality of training data:
 People: The people who train the model have a significant impact on its accuracy or
performance. If they're biased, it‟ll naturally affect how they tag data and, ultimately,
how the ML model functions.
 Processes: The data labeling process must have tight quality control checks in place.
This will significantly increase the quality of training data.
 Tools: Incompatible or outdated tools can make data quality suffer. Using robust data
labeling software can reduce the cost and time associated with the process.

18. How much training data is enough?

 There isn't a specific answer to how much training data is enough training data. It
depends on the algorithm you're training – its expected outcome, application,
complexity, and many other factors.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Suppose you want to train a text classifier that categorizes sentences based on the
occurrence of the terms "cat" and "dog" and their synonyms such as "kitty," "kitten,"
"pussycat," "puppy," or "doggy". This might not require a large dataset as there are only
a few terms to match and sort.
 But, if this was an image classifier that categorized images as "cats" and "dogs," the
number of data points needed in the training dataset would shoot up significantly. In
short, many factors come into play to decide what training data is enough training data.
 The amount of data required will change depending on the algorithm used.
 For context, deep learning, a subset of machine learning, requires millions of data points
to train the artificial neural networks (ANNs). In contrast, machine learning algorithms
require only thousands of data points. But of course, this is a far-fetched generalization
as the amount of data needed varies depending on the application.
 The more you train the model, the more accurate it becomes. So it's always better to
have a large amount of data as training data.
Garbage in, garbage out
 The phrase "garbage in, garbage out" is one of the oldest and most used phrases in data
science. Even with the rate of data generation growing exponentially, it still holds true.
 The key is to feed high-quality, representative data to machine learning algorithms.
Doing so can significantly enhance the accuracy of models. Good quality training data is
also crucial for creating unbiased machine learning applications.

19. How should you split up a dataset into test and training sets?

 When we are working on the model development, we need to train it and test it on the
same dataset. Since it is challenging to possess a vast number of data while the model is
in the development phase, the most obvious answer is to split the data into two separate
sets, out of which one will be for training and the other will be testing.
A. Splitting the data set into a training set and test set:
The two conditions that need to be taken care of before proceeding with the splitting of the
dataset:
 The test set needs to be large to give statistically essential outputs.
 The characteristics of the training set and test set should be similar.
Therefore, after the satisfaction of the above two conditions, the ultimate goal should be to
develop a model that can easily perform functions with the new dataset.
B. Validation of the trained model over the test data.
The model should not train over the test data. Many times good results on evaluation
metrics are an indication that inadvertently, you are training on test data.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

20. What is test data?

 A test set in machine learning is a secondary (or tertiary) data set that is used to test a
machine learning program after it has been trained on an initial training data set. The
idea is that predictive models always have some sort of unknown capacity that needs to
be tested out, as opposed to analysed from a programming perspective.
 A test set is also known as a test data set or test data.

21. What is validation data?

 In machine learning, a validation set is used to “tune the parameters” of a classifier. The
validation test evaluates the program‟s capability according to the variation of
parameters to see how it might function in successive testing.
 The validation set is also known as a validation data set, development set or dev set.

22. Compare Training data vs. test data vs. validation data.

 Training data is used in model training, or in other words, it's the data used to fit the
model. On the contrary, test data is used to evaluate the performance or accuracy of the
model. It's a sample of data used to make an unbiased evaluation of the final model fit
on the training data.
 A training dataset is an initial dataset that teaches the ML models to identify desired
patterns or perform a particular task. A testing dataset is used to evaluate how effective
the training was or how accurate the model is.
 Once an ML algorithm is trained on a particular dataset and if you test it on the same
dataset, it's more likely to have high accuracy because the model knows what to expect.
If the training dataset contains all possible values the model might encounter in the
future, all well and good.
 But that's never the case. A training dataset can never be comprehensive and can't teach
everything that a model might encounter in the real world. Therefore a test dataset,
containing unseen data points, is used to evaluate the model's accuracy.
 Then there's validation data. This is a dataset used for frequent evaluation during the
training phase. Although the model sees this dataset occasionally, it doesn't learn from
it. The validation set is also referred to as the development set or dev set. It helps protect
models from overfitting and underfitting.
 Although validation data is separate from training data, data scientists might reserve a
part of the training data for validation. But of course, this automatically means that the
validation data was kept away during the training.
 Many use the terms "test data" and "validation data" interchangeably. The main
difference between the two is that validation data is used to validate the model during
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

the training, while the testing set is used to test the model after the training is
completed.
 The validation dataset gives the model the first taste of unseen data. However, not all
data scientists perform an initial check using validation data. They might skip this part
and go directly to testing data.

23. Explain with neat sketch K-fold cross-validation mode.

 The classifier model can be designed/trained and performance can be evaluated based
on K-fold cross-validation mode, training mode and test mode.
 The main idea behind K-Fold cross-validation is that each sample in our dataset has
the opportunity of being tested. It is a special case of cross-validation where we
iterate over a dataset set k times. In each round, we split the dataset into k parts:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

one part is used for validation, and the remaining k-1 parts are merged into a
training subset for model evaluation
 Computation time is reduced as we repeated the process only 10 times when the value
of k is 10. It has Reduced bias.
 Every data points get to be tested exactly once and is used in training k-1 times
 The variance of the resulting estimate is reduced as k increases

24. Explain with neat sketch 5-fold cross-validation mode.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

25. What is hyper parameter tuning?

• Machine learning algorithms have hyperparameters that allow you to tailor the
behavior of the algorithm to your specific dataset.
• Hyperparameters are different from parameters, which are the internal coefficients or
weights for a model found by the learning algorithm. Unlike parameters,
hyperparameters are specified by the practitioner when configuring the model.
• Typically, it is challenging to know what values to use for the hyperparameters of a given
algorithm on a given dataset, therefore it is common to use random or grid search
strategies for different hyperparameter values.
• The more hyperparameters of an algorithm that you need to tune, the slower the
tuning process. Therefore, it is desirable to select a minimum subset of model
hyperparameters to search or tune.

26. Explain hyper parameter tuning for simple decision tree.

Max_Depth: The maximum depth of the tree. If this is not specified in the Decision Tree, the
nodes will be expanded until all leaf nodes are pure or until all leaf nodes contain less than
min_samples_split.
 Default = None
 Input options → integer
Min_Samples_Split: The minimum samples required to split an internal node. If the amount
of sample in an internal node is less than the min_samples_split, then that node will become
a leaf node.
 Default = 2
 Input options → integer or float (if float, then min_samples_split is fraction)
Min_Samples_Leaf: The minimum samples required to be at a leaf node. Therefore, a split
can only happen if it leaves at least the min_samples_leaf in both of the resulting nodes.
 Default = 1
 Input options → integer or float (if float, then min_samples_leaf is fraction)
Max_Features: The number of features to consider when looking for the best split. For
example, if there are 35 features in a dataframe and max_features is 9, only 9 of the 35
features will be used in the decision tree.
 Default = None
 Input options → integer, float (if float, then max_features is fraction) or {“auto”, “sqrt”,
“log2”}
 “auto”: max_features=sqrt(n_features)
 “sqrt”: max_features = sqrt(n_features)

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 “log2”: max_features=log2(n_features)

27. Explain hyper parameter tuning for SVM.

 The choice of kernel that will control the manner in which the input variables will
be projected. There are many to choose from, but linear, polynomial, and RBF are the
most common, perhaps just linear and RBF in practice.
 kernels in [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
 If the polynomial kernel works out, then it is a good idea to dive into the degree
hyperparameter.
 Another critical parameter is the penalty (C) that can take on a range of values and has a
dramatic effect on the shape of the resulting regions for each class. A log scale might be
a good starting point.
 C in [100, 10, 1.0, 0.1, 0.001]

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

28. Explain hyper parameter tuning for ANN.

 Number of neurons: A weight is the amplification of input signals to a neuron and bias
is an additive bias term to a neuron.
 Activation function: Defines how a neuron or group of neurons activate ("spiking")
based on input connections and bias term(s).
 Learning rate: Step length for gradient descent update
 Batch size: Number of training examples in each gradient descent (gd) update.
 Epochs: The number of times all training examples have been passed through the
network during training.
 Loss function: Loss function specifies how to calculate the error between prediction and
label for a given training example. The error is backpropagated during training in order
to update learnable parameters.
 Number of layers: Typically layers between input and output layer, which are called
hidden layers

Theory/Mathematics based questions/ Problems/Numerical

29. Explain different performance evaluators used for interpretation/assessment

of classification model. Explain 2 x 2 confusion matrix and explain its

terminology. Explain Cohen's Kappa, F Score, ROC Curve. & Problems on

calculating parameters required for model evaluation such True positive,

True negative, False positive, False negative etc.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

CLASSIFICATION MODEL:
 Confusion matrix: A Confusion matrix is an N x N matrix used for evaluating the
performance of a classification model, where N is the number of target classes. The
matrix compares the actual target values with those predicted by the machine
learning model
 What can we learn from this matrix?
• There are two possible predicted classes: "yes" and "no". If we were predicting the
presence of a disease, for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
• The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
• Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60 patients do not.

 True positives (TP): these are cases in which we predicted yes (they have the disease),
and they do have the disease.
 True negatives (tn): we predicted no, and they don't have the disease.
 False positives (fp): we predicted yes, but they don't actually have the disease. (Also
known as a "type I error.")
 False negatives (fn): we predicted no, but they actually do have the disease. (Also
known as a "type II error.")

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Accuracy: Overall, how often is the classifier correct? (TP+TN)/total = (100+50)/165 =


0.91
 Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total = (10+5)/165 =
0.09 which is equivalent to 1 minus Accuracy
 True Positive Rate: When it's actually yes, how often does it predict yes? TP/actual yes
= 100/105 = 0.95 also known as "Sensitivity" or "Recall"
 False Positive Rate: When it's actually no, how often does it predict yes? FP/actual no =
10/60 = 0.17
 True Negative Rate: When it's actually no, how often does it predict no? TN/actual no =
50/60 = 0.83 which is equal to 1 minus False Positive Rate
 Precision: When it predicts yes, how often is it correct? TP/predicted yes = 100/110 =
0.91
 Cohen's Kappa: This is essentially a measure of how well the classifier performed as
compared to how well it would have performed simply by chance. In other words, a
model will have a high Kappa score if there is a big difference between the accuracy and
the null error rate. (More details about Cohen's Kappa.)
 F Score: This is a weighted average of the true positive rate (recall) and precision. (More
details about the F Score.)
 ROC Curve: This is a commonly used graph that summarizes the performance of a
classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-
axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning
observations to a given class. (More details about ROC Curves.)

 REGRESSION MODEL:
 Mean Absolute Error (MAE)
The mean absolute error (MAE) is defined the MAE as

Where is the actual value is the predicted value and is the absolute value
of the difference between the actual and predicted value.
N is the number of sample points.
Let's dig into this a bit deeper to understand what this calculation represents.
Take a look at the following plot, which shows the number of failures for a piece of
machinery against the age of the machine:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

In order to predict the number of failures from the age, we would want to fit a regression
line such as this:

In order to understand how well this line represents the actual data, we need to measure
how good a fit it is. We can do this by measuring the distance from the actual data points to
the line:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

You may recall that these distances are called residuals or errors. The mean size of these
errors is the MAE. We can calculate it as follows:

The mean of the absolute errors (MAE) is 8.5. Why do we take the absolute value? To
remove the sign on the error value! If we don't, the positive and negative errors will tend to
cancel each other out, giving a misleadingly small value for our evaluation metric. If
mathematical symbols are not your strong point, you may not immediately see how this
calculation relates to the formula at the start of this chapter:

So here is how the table and formula relate:

Mean Absolute Error (MAE) tells us the average error in units of y, the predicted feature. A
value of 0 indicates a perfect fit, i.e. all our predictions are spot on.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

The MAE has a big advantage in that the units of the MAE are the same as the units of y,
the feature we want to predict. In the example above, we have an MAE of 8.5, so it means
that on average our predictions of the number of machine failures are incorrect by 8.5
machine failures. This makes MAE very intuitive and the results are easily conveyed to a non-
machine learning expert!
 Root Mean Square Error (RMSE)
Another evaluation metric for regression is the root mean square error (RMSE). Its
calculation is very similar to MAE, but instead of taking the absolute value to get rid of the
sign on the individual errors, we square the error (because the square of a negative number
is positive). The formula for RMSE is:

Here is the calculation for RMSE on our example scenario:

As with MAE, we can think of RMSE as being measured in the y units. So the above error
can be read as an error of 9.9 machine failures on average per observation.
 MAE vs. RMSE
Compared to MAE, RMSE gives a higher total error and the gap increases as the errors
become larger. It penalizes a few large errors more than a lot of small errors. If you want
your model to avoid large errors, use RMSE over MAE.
Root Mean Square Error (RMSE) indicates the average error in units of y, the predicted
feature, but penalizes larger errors more severely than MAE. A value of 0 indicates a perfect
fit. You should also be aware that as the sample size increases, the accumulation of slightly

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

higher RMSEs than MAEs means that the gap between these two measures also increases as
the sample size increases.
 R-Squared
As stated above that an advantage of both MAE and RMSE is that they can be thought of as
errors in the units of y, the predicted feature. This is helpful when relaying the results to non-
data scientists.
We can say things like "our model can predict the reliability of our machinery to within 8.5
machine failures on average" or "our model can predict the selling price of a house to within
£15k on average".
But take heed! This advantage can also be considered a disadvantage! It says nothing about
whether an error of 8.5 machine failures or an error of £15k on a house price is good or bad.
We can't compare how good different models are for different scenarios. This is where R-
squared or R2 comes in. Here is the formula for R2.

R2 computes how much better the regression line fits the data than the mean line.
Another way to look at this formula is to compare the variance around the mean line to the
variation around the regression line:

Take our example above, predicting the number of machine failures. We can examine the
errors for our regression line as we did before. We can also compute a mean line (by taking
the mean y (value) and examine the errors against this mean line. That is to say, we can see
the errors we would get if our model just predicted the mean number of failures (50.8) for
every age input. Here are the regression and mean lines, and their respective errors:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

You can see that the regression line fits the data better than the mean line, which is what we
expected (the mean line is a pretty simplistic model, after all). But can you say how much
better it is? That's exactly what R2 does! Here is the calculation.

Notice something? Most of this is the same as the calculation of RMSE. The additional parts
to the calculation are the column on the far right (in blue) and the final calculation row,
computing R2. So we have an R-squared of 0.85. Without even worrying about the units of y
we can say this is a decent model. Why? Because the model explains 85% of the variation in
the data. That's exactly what an R-squared of 0.85 tells us!
R-squared (R2) tells us the degree to which the model explains the variance in the data. In
other words, how much better it is than just predicting the mean. Here's another example.
What if our data points and regression line looked like this?

The variance around the regression line is 0. In other words, var(line) is 0. There are no errors.
Now, remember that the formula for R-squared is:

So, with var(line) = 0 the above calculation for R-squared is

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

So, if we have a perfect regression line, with no errors, we get an R-squared of 1. Let's look at
another example. What if our data points and regression line looked like this, with the
regression line equal to the mean line?

In this case, var(line) and var(mean) are the same. So the above calculation will yield an R-
squared of 0:

So, if our regression line is only as good as the mean line, we get an R-squared of 0. What if
our regression line was really bad, worse than the mean line?

It's unlikely to get this bad! But if it does, var(mean)-var(line) will be negative, so R-squared
will be negative. An R-squared of 1 indicates a perfect fit. An R-squared of 0 indicates a
model no better or worse than the mean. An R-squared of less than 0 indicates a model
worse than just predicting the mean.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

30. Identify methodology to attempt following problems and enlist general steps

involved in it.

Methodology used for building regression models:

• To estimate remaining useful life of bearings /gears /cutting tool


• To guess dryness fraction in the steam generated by boiler
• To forecast material property
• To estimate engine emissions for remaining useful life
• To predict refrigerant two-phase pressure drop inside brazed plate heat exchangers
• To project enhanced state of charge of Lithium-ion Batteries used in EV

Methodology used for building clustering models:

• Discover product segments for marketing purposes


• Identify high temperature zones in process equipment.
• To recognize microstructure of material

Methodology used for building classification models:

• To diagnose condition of rotating machine element as healthy or faulty


• To identify quality of steam generated by boiler as wet, dry saturated or superheated
• To provide the correct quality of air-fuel mixture during the conditions such as
starting or idling or cruising
• To estimate wear state of Rolling Element Bearings

Feel free to contact me on +91-8329347107 calling / +91-9922369797 whatsapp,


email ID: [email protected] and [email protected]

*********************

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


Artificial Intelligence & Machine Learning
Course Code: 302049

Unit 5: Reinforced and Deep Learning


Third Year Bachelor of Engineering (Choice Based Credit System)
Mechanical Engineering (2019 Course)
Board of Studies – Mechanical and Automobile Engineering, SPPU, Pune
(With Effect from Academic Year 2021-22)

Question bank and its solution


by

Abhishek D. Patange, Ph.D.


Department of Mechanical Engineering
College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Unit 5: REINFORCED AND DEEP LEARNING


Syllabus:
Content Theory Mathematics Numerical
• Reinforced learning
Characteristics of reinforced learning
Algorithms: Value Based, Policy Based, Model
Based; Positive Vs. Negative Reinforced
Learning
Models: Markov Decision Process, Q Learning
• Deep Learning
Characteristics of Deep Learning
Artificial Neural Network
Convolution Neural Network
• Application of Reinforced and Deep Learning in Mechanical Engineering

Type of question and marks:


Type Theory Mathematics Numerical
Marks 2 or 4 or 6 marks 4 marks 2 or 4 marks

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Theory Mathematics Numerical


Topic: Characteristics of reinforced learning

Theory questions

1. What is reinforcement learning? State one practical example.

 Reinforcement Learning is a feedback-based Machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without
any labeled data, unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
 The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
 The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning is
a type of machine learning method where an intelligent agent (computer program)
interacts with the environment and learns to act within that." How a Robotic dog
learns the movement of his arms is an example of Reinforcement learning.
 It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
 Example:
Suppose there is an AI agent present within a maze environment, and his goal is to find
the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward
or penalty as feedback.
The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

2. State key constituents of reinforcement learning. (Explain key terms in

reinforcement learning.)

 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken by the
agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).

3. State key features of reinforcement learning.

 In RL, the agent is not instructed about the environment and what actions need to be
taken.
 It is based on the hit and trial process.
 The agent takes the next action and changes states according to the feedback of the
previous action.
 The agent may get a delayed reward.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.

4. Explain approaches to implement reinforcement learning.

OR

Explain value-based, policy-based, and model-based reinforcement learning.

There are mainly three ways to implement reinforcement-learning in ML, which are:
 Value-based: The value-based approach is about to find the optimal value function,
which is the maximum value at a state under any policy. Therefore, the agent expects the
long-term return at any state(s) under policy π.
 Policy-based: Policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach, the agent tries to apply
such a policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
 Deterministic: The same action is produced by the policy (π) at any state.
 Stochastic: In this policy, probability determines the produced action.
 Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.

5. Explain elements of reinforcement learning.

There are four main elements of Reinforcement Learning, which are given below:
Policy, Reward Signal, Value Function, Model of the environment
 Policy: A policy can be defined as a way how an agent behaves at a given time. It maps
the perceived states of the environment to the actions taken on those states. A policy is
the core element of the RL as it alone can define the behavior of the agent. In some
cases, it may be a simple function or a lookup table, whereas, for other cases, it may
involve general computation as a search process. It could be deterministic or a stochastic
policy:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
 Reward Signal: The goal of reinforcement learning is defined by the reward signal. At
each state, the environment sends an immediate signal to the learning agent, and this
signal is known as a reward signal. These rewards are given according to the good and
bad actions taken by the agent. The agent's main objective is to maximize the total

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

number of rewards for good actions. The reward signal can change the policy, such as if
an action selected by the agent leads to low reward, then the policy may change to
select other actions in the future.
 Value Function: The value function gives information about how good the situation and
action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good
state and action for the future. The value function depends on the reward as, without
reward, there could be no value. The goal of estimating values is to achieve more
rewards.
 Model: The last element of reinforcement learning is the model, which mimics the
behaviour of the environment. With the help of the model, one can make inferences
about how the environment will behave. Such as, if a state and an action are given, then
a model can predict the next state and reward. The model is used for planning, which
means it provides a way to take a course of action by considering all future situations
before actually experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based approach.
Comparatively, an approach without using a model is called a model-free approach.

6. How does Reinforcement Learning Work?

To understand the working process of the RL, we need to consider two main things:
 Environment: It can be anything such as a room, maze, football ground, etc.
 Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider the
below image:

In the above image, the agent is at the very first block of the maze. The maze is consisting of
an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block,
then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four
actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible
fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-
reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step.
To memorize the steps, it assigns 1 value to each previous step. Consider the below step:

Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which has 1
value block on both sides? Consider the below diagram:

It will be a difficult condition for the agent whether he should go up or down as each block
has the same value. So, the above approach is not suitable for the agent to reach the
destination. Hence to solve the problem, we will use the Bellman equation, which is the
main concept behind reinforcement learning.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Theory/Mathematics based questions/ Problems/Numerical

7. What is the Bellman Equation? How is it helpful in RL?

The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the
year 1953, and hence it is called as a Bellman equation. It is associated with dynamic
programming and used to calculate the values of a decision problem at a certain point by
including the values of previous states.
It is a way of calculating the value functions in dynamic programming or environment that
leads to modern reinforcement learning.
The key-elements used in Bellman equations are:
 Action performed by the agent is referred to as "a"
 State occurred by performing the action is "s."
 The reward/feedback obtained for each good and bad action is "R."
 A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
In the above equation, we are taking the max of the complete values because the agent tries
to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no
reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no
reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

For 4th block:


V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is no
reward at this state also.
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is no
reward at this state also.
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Consider the below image:

Now, we will move further to the 6th block, and here agent may change the route because it
always tries to find the optimal path. Let's consider from the block next to the fire pit.

Now, the agent has three options to move; if he moves to the blue box, then he will feel a
bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking only
positive rewards, so for this, he will move to upwards only. The complete block values will be
calculated using this formula. Consider the below image:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

8. Explain types of reinforcement learning: (Positive & Negative reinforcement)

Positive Reinforcement: The positive reinforcement learning means adding something to


increase the tendency that expected behavior would occur again. It impacts positively on the
behavior of the agent and increases the strength of the behavior. This type of reinforcement
can sustain the changes for a long time, but too much positive reinforcement may lead to an
overload of states that can reduce the consequences. Advantages are:
 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can diminish the
results
Negative Reinforcement: The negative reinforcement learning is opposite to the positive
reinforcement as it increases the tendency that the specific behavior will occur again by
avoiding the negative condition. It can be more effective than the positive reinforcement
depending on situation and behavior, but it provides reinforcement only to meet minimum
behavior. Advantages are:
 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior

9. How to represent the agent state?

We can represent the agent state using the Markov State that contains all the required
information from the history. The State St is Markov state if it follows the given condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
The Markov state follows the Markov property, which says that the future is independent of
the past and can only be defined with the present. The RL works on fully observable

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

environments, where the agent can observe the environment and act for the new state. The
complete process is known as Markov Decision process, which is explained below:

10. Explain Markov Decision Process


Markov Decision Process or MDP, is used to formalize the reinforcement learning
problems. If the environment is completely observable, then its dynamic can be modeled as
a Markov Process. In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state.

MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
 A set of finite States S
 A set of finite Actions A
 Rewards received after transitioning from state S to state S', due to action a.
 Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property: It says that "If the agent is present in the current state S1, performs an
action a1 and move to the state s2, then the state transition from s1 to s2 only depends on
the current state and future action and states do not depend on past actions, rewards, or
states." Or, in other words, as per Markov Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the current state and
do not need to remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

P) on state S and transition function P. These two components (S and P) can define the
dynamics of the system.

11. Explain Q-Learning.

 Q-learning is an off policy RL algorithm, which is used for the temporal difference
Learning. The temporal difference learning methods are the way of comparing
temporally successive predictions.
 It learns the value function Q (S, a), which means how good to take action "a" at a
particular state "s."
 The below flowchart explains the working of Q- learning:

State Action Reward State action (SARSA):


 SARSA stands for State Action Reward State action, which is an on-policy temporal
difference learning method. The on-policy control method selects the action for each
state while learning using a specific policy.
 The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and
all pairs of (s-a).
 The main difference between Q-learning and SARSA algorithms is that unlike Q-
learning, the maximum reward for the next state is not required for updating the
Q-value in the table.
 In SARSA, new action and reward are selected using the same policy, which has
determined the original action.The SARSA is named because it uses the quintuple Q(s, a,
r, s', a').
Where,

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair
Deep Q Neural Network (DQN):
 As the name suggests, DQN is a Q-learning using Neural networks.
 For a big state space environment, it will be a challenging and complex task to define
and update a Q-table.
 To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-
table, neural network approximates the Q-values for each action and state.
 Now, we will expand the Q-learning.
Q-Learning Explanation:
 Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
 The main objective of Q-learning is to learn the policy which can inform the agent
that what actions should be taken for maximizing the reward under what
circumstances.
 It is an off-policy RL that attempts to find the best action to take at a current state.
 The goal of the agent in Q-learning is to maximize the value of Q.
 The value of Q-learning can be derived from the Bellman equation. Consider the Bellman
equation given below:

 In the equation, we have various components, including reward, discount factor (γ ),


probability, and end states s'. But there is no any Q-value is given so first consider the
below image:

 In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future state.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

The agent can go to any direction (Up, Left, or Right), so he needs to decide where to go
for the optimal path. Here agent will take a move as per probability bases and changes
the state. But if we want some exact moves, so for this, we need to make some changes
in terms of Q-value. Consider the below image:

 Q represents the quality of the actions at each state. So instead of using a value at each
state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which
action is more lubricative than others, and according to the best Q-value, the agent takes
his next move. The Bellman equation can be used for deriving the Q-value.
 To perform any action, the agent will get a reward R(s, a), and also he will end up on a
certain state, so the Q -value equation will be:

 Hence, we can say that, V(s) = max [Q(s, a)]

 The above formula is used to estimate the Q-values in Q-Learning.


What is 'Q' in Q-learning?
 The Q stands for quality in Q-learning, which means it specifies the quality of an action
taken by the agent.
Q-table: A Q-table or matrix is created while performing the Q-learning. The table follows
the state and action pair, i.e., [s, a], and initializes the values to zero. After each action, the
table is updated, and the q-values are stored within the table. The RL agent uses this Q-table
as a reference table to select the best action based on the q-values.

12. Difference between Reinforcement Learning and Supervised Learning.

The Reinforcement Learning and Supervised Learning both are the part of machine learning,
but both types of learnings are far opposite to each other. The RL agents interact with the

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

environment, explore it, take action, and get rewarded. Whereas supervised learning
algorithms learn from the labeled dataset and, on the basis of the training, predict the
output. The difference table between RL and Supervised learning is given below:
Reinforcement Learning Supervised Learning
RL works by interacting with the Supervised learning works on the existing
environment. dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the Training is provided to the algorithm so that
learning agent. it can predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made
when input is given.

13. Explain various practical applications of reinforcement learning.

Applications in self-driving cars


Various papers have proposed Deep Reinforcement Learning for autonomous driving. In
self-driving cars, there are various aspects to consider, such as speed limits at various places,
drivable zones, avoiding collisions — just to mention a few.
Some of the autonomous driving tasks where reinforcement learning could be applied
include trajectory optimization, motion planning, dynamic pathing, controller optimization,
and scenario-based learning policies for highways.
For example, parking can be achieved by learning automatic parking policies. Lane changing
can be achieved using Q-Learning while overtaking can be implemented by learning an
overtaking policy while avoiding collision and maintaining a steady speed thereafter.
AWS DeepRacer is an autonomous racing car that has been designed to test out RL in a
physical track. It uses cameras to visualize the runway and a reinforcement learning model to
control the throttle and direction.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Wayve.ai has successfully applied reinforcement learning to training a car on how to drive in
a day. They used a deep reinforcement learning algorithm to tackle the lane following task.
Their network architecture was a deep network with 4 convolutional layers and 3 fully
connected layers. The example below shows the lane following task. The image in the middle
represents the driver’s perspective.
Industry automation with Reinforcement Learning
In industry reinforcement, learning-based robots are used to perform various tasks. Apart
from the fact that these robots are more efficient than human beings, they can also perform
tasks that would be dangerous for people.
A great example is the use of AI agents by Deepmind to cool Google Data Centers. This led
to a 40% reduction in energy spending. The centers are now fully controlled with the AI
system without the need for human intervention. There is obviously still supervision from
data center experts. The system works in the following way:
 Taking snapshots of data from the data centers every five minutes and feeding this to
deep neural networks
 It then predicts how different combinations will affect future energy consumptions
 Identifying actions that will lead to minimal power consumption while maintaining a set
standard of safety criteria
 Sending and implement these actions at the data center
 The actions are verified by the local control system.
Reinforcement learning applications in engineering
In the engineering frontier, Facebook has developed an open-source reinforcement
learning platform — Horizon. The platform uses reinforcement learning to optimize large-
scale production systems. Facebook has used Horizon internally:
 to personalize suggestions
 deliver more meaningful notifications to users
 optimize video streaming quality.
 Horizon also contains workflows for:
 simulated environments
 a distributed platform for data preprocessing
 training and exporting models in production.
A classic example of reinforcement learning in video display is serving a user a low or high
bit rate video based on the state of the video buffers and estimates from other machine
learning systems. Horizon is capable of handling production-like concerns such as:
deploying at scale, feature normalization, distributed learning, serving and handling datasets
with high-dimensional data and thousands of feature types.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Reinforcement Learning in robotics manipulation


The use of deep learning and reinforcement learning can train robots that have the ability to
grasp various objects — even those unseen during training. This can, for example, be used in
building products in an assembly line. This is achieved by combining large-scale distributed
optimization and a variant of deep Q-Learning called QT-Opt. QT-Opt support for
continuous action spaces makes it suitable for robotics problems. A model is first trained
offline and then deployed and fine-tuned on the real robot. Google AI applied this approach
to robotics grasping where 7 real-world robots ran for 800 robot hours in a 4-month
period. In this experiment, the QT-Opt approach succeeds in 96% of the grasp attempts
across 700 trials grasps on objects that were previously unseen. Google AI’s previous
method had a 78% success rate. Refer following links for demonstartions
https://www.youtube.com/watch?v=W4joe3zzglU
https://aws.amazon.com/fr/deepracer/
RL can be used in large environments in the following situations:
1. A model of the environment is known, but an analytic solution is not available;
2. Only a simulation model of the environment is given (the subject of simulation-based
optimization)
3. The only way to collect information about the environment is to interact with it.

14. What is deep learning?

Deep learning is a branch of machine learning which is completely based on artificial neural
networks, as neural network is going to mimic the human brain so deep learning is also a
kind of mimic of human brain. In deep learning, we don’t need to explicitly program
everything. The concept of deep learning is not new. It has been around for a couple of years
now. It’s on hype nowadays because earlier we did not have that much processing power
and a lot of data. As in the last 20 years, the processing power increases exponentially, deep
learning and machine learning came in the picture.A formal definition of deep learning is-
neurons. Deep learning is a particular kind of machine learning that achieves great power
and flexibility by learning to represent the world as a nested hierarchy of concepts, with each
concept defined in relation to simpler concepts, and more abstract representations
computed in terms of less abstract ones. In human brain approximately 100 billion neurons
all together this is a picture of an individual neuron and each neuron is connected through
thousands of their neighbours. The question here is how do we recreate these neurons in a
computer. So, we create an artificial structure called an artificial neural net where we have
nodes or neurons. We have some neurons for input value and some for output value and in
between, there may be lots of neurons interconnected in the hidden layer.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 DL is a subfield of Machine Learning, inspired by the biological neurons of a brain,


and translating that to artificial neural networks with representation learning.
 When the volume of data increases, Machine learning techniques, no matter how
optimized, starts to become inefficient in terms of performance and accuracy,
whereas Deep learning performs soo much better in such cases.
 Well one cannot quantify a threshold for data to be called big, but intuitively let’s say
a Million sample might be enough to say ―It’s Big‖(This is where Michael Scott
would’ve uttered his famous words ―That’s what she said‖)

15. State difference between Machine Learning and Deep Learning.

Machine Learning Deep Learning


Works on small amount of Dataset for accuracy. Works on Large amount of Dataset.
Dependent on Low-end Machine. Heavily dependent on High-end Machine.
Divides the tasks into sub-tasks, solves them Solves problem end to end.
individually and finally combine the results.
Takes less time to train. Takes longer time to train.
Testing time may increase. Less time to test the data.

16. State different architectures of DL network.

 Deep Neural Network – It is a neural network with a certain level of complexity (having
multiple hidden layers in between input and output layers). They are capable of
modeling and processing non-linear relationships.
 Deep Belief Network(DBN) – It is a class of Deep Neural Network. It is multi-layer belief
networks. Steps for performing DBN:
a. Learn a layer of features from visible units using Contrastive Divergence algorithm.
b. Treat activations of previously trained features as visible units and then learn features
of features.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

c. Finally, the whole DBN is trained when the learning for the final hidden layer is
achieved.
 Recurrent (perform same task for every element of a sequence) Neural Network –
Allows for parallel and sequential computation. Similar to the human brain (large
feedback network of connected neurons). They are able to remember important things
about the input they received and hence enables them to be more precise.

17. Explain Artificial Neural Network (ANN).

This is another name for Deep Neural network or Deep Learning.


What does a Neural Network mean?
 What neural network essentially means is we take logistic regression and repeat it
multiple times.
 In a normal logistic regression, we have an input layer and an output layer.
 But in the case of a Neural Network, there is at least one hidden layer of regression
between these input and output layers.

How many layers are needed to call it a “Deep” neural network?


 Well of course there is no specific amount of layers to classify a neural network as
deep.
 The term ―Deep‖ is quite frankly relative to every problem.
 The correct question we can ask is ―How much deep?‖.
 For example, the answer to ―How deep is your swimming pool?‖ can be answered in
multiple ways.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 It could be 2 meters deep or 10 meters deep, but it has ―depth‖. Same with our
neural network, it can have 2 hidden layers or ―thousands‖ hidden layers(yes you
heard that correctly).
 So I’d like to just stick with the question of ―How much deep?‖ for the time being.
Why are the hidden layers?
 They are called hidden because they do not see the original inputs( the training set ).
 For example, let’s say you have a NN with an input layer, one hidden layer, and an
output layer.
 When asked how many layers your NN has, your answer should be ―It has 2 layers‖,
because while computation the initial, or the input layer, is ignored.
 Let me help visualize how a 2 layer Neural network looks like.
Step by step we shall understand this image.
1) As you can see here we have a 2 Layered Artificial Neural Network. A Neural network was
created to mimic the biological neuron of the human brain. In our ANN we have a ―k‖
number of nodes. The number of nodes is a hyperparameter, which essentially means that
the amount is configured by the practitioner making the model.
2) The inputs and outputs layers do not change. We have ―n‖ input features and 3 possible
outcomes.
3) Unlike Logistic regression, neural networks use the tanh function as their activation
function instead of the sigmoid function which you are quite familiar with. The reason is that
the mean of its output is closer to 0 which makes the more centered for input to the next
layer. tanh function can cause an increase in non-linearity which makes our model learn
better.

4) In normal logistic regression: Input => Output.


Whereas in a Neural network: Input => Hidden Layer => Output. The hidden layer can be
imagined as the output of part 1 and input of part 2 of our ANN.
Now let us have a more practical approach to a 2 Layered Neural Network.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

(Important Note: We shall continue where we left off in the previous article. I’m not going to
waste your time and mine by loading the dataset again and preparing it. The link to the Part
1 of this series is given above.)

18. Explain elements of Deep Learning?

Researchers tried to mimic the working of the human brain and replicated it into the
machine making machines capable of thinking and solving complex problems. Deep
Learning (DL) is a subset of Machine Learning (ML) that allows us to train a model using a set
of inputs and then predict output based. Like the human brain, the model consists of a set of
neurons that can be grouped into 3 layers:
a) Input Layer: It receives input and passes it to hidden layers.
b) Hidden Layers: There can be 1 or more hidden layers in Deep Neural Network (DNN).
―Deep‖ in DL refers to having more than 1 layer. All computations are done by hidden layers.
c) Output Layer: This layer receives input from the last hidden layer and gives the output.

19. Explain working of deep Learning with an example.

We will see how DNN works with the help of the train price prediction problem. For
simplicity, we have taken 3 inputs, namely, Departure Station, Arrival Station, Departure
Date. In this case, the input layer will have 3 neurons, one for each input. The first hidden
layer will receive input from the input layer and will start performing mathematical
computations followed by other hidden layers. The number of hidden layers and number of
neurons in each hidden layer is hyperparameters that are challenging task to decide. The
output layer will give the predicted price value. There can be more than 1 neuron in the
output layer. In our case, we have only 1 neuron as output is a single value.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Now, how the price prediction is made by hidden layers? How computation is done inside
hidden layers? This will be explained with help of activation function, loss function, and
optimizers.

20. What is activation Functions?

Each neuron has an activation function that performs computation. Different layers can
have different activation functions but neurons belonging to one layer have the same
activation function. In DNN, a weighted sum of input is calculated based on weights and
inputs provided. Then, the activation function comes into the picture that works on weighted
sum and converts it into output.

21. Why activation functions are required?

Activation functions help model learn complex relationship that exists within the dataset. If
we do not use the activation function in neurons and give weighted sum as output, in that
case, computations will be difficult as there is no specific range for weighted sum. So, the
activation function helps to keep output in a particular range. Secondly, the non-linear
activation function is always preferred as it adds non-linearity to the dataset which otherwise
would form a simple linear regression model incapable of taking the benefit of hidden
layers. The relu function or its varients is mostly used for hidden layers and sigmoid/ softmax
function is mostly used for final layer for binary/ multi-class classification problems.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

22. What is Loss/ Cost Function?

To train the model, we give input (departure location, arrival location and, departure date in
case of train price prediction) to the network and let it predict the output making use of
activation function. Then, we compare predicted output with the actual output and compute
the error between two values. This error between two values is computed using loss/ cost
function. The same process is repeated for entire training dataset and we get the average
loss/error. Now, the objective is to minimize this loss to make the model accurate. There
exist weights between each connection of 2 neurons. Initially, weights are randomly
initialized and the motive is to update these weights with every iteration to get the minimum
value of the loss/ cost function. We can change the weights randomly but that is not efficient
method. Here comes the role of optimizers which updates weights automatically.

23. What are different loss functions and their use case?

Loss function is chosen based on the problem.


a. Regression Problem
Mean squared error (MSE) is used where real value quantity is to be predicted.
MSE in case of train price prediction as price predicted is real value quantity.
b. Binary/ Multi-class Classification Problem
Cross-entropy is used.
c. Maximum- Margin Classification
Hinge loss is used.

24. Explain optimizers. Why optimizers are required?

Once loss for one iteration is computed, optimizer is used to update weights. Instead of
changing weights manually, optimizers can update weights automatically in small
increments and helps to find the minimum value of the loss/ cost function. Magic of DL!!
Finding minimum value of cost function requires iterating through dataset many times and
thus requires large computational power. Common technique used to update these weights
is gradient descent.

25. What is Gradient Descent (GD) and its variants?

It is used to find minimum value of loss function by updating weights. There are 3 variants:
a) Batch/ Vanila Gradient
 In this, gradient for entire dataset is computed to perform one weight update.
 It gives good results but can be slow and requires large memory.
b) Stochastic Gradient Descent (SGD)
 Weights are updated for each training data point.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 Therefore, frequent updates are performed and thus can cause objective function to
fluctuate.
c) Mini-batch Gradient Descent
 It takes best of batch gradient and SGD.
 It is algorithm of choice.
 Reduces frequency of updates and thus can lead to more stable convergence.
 Choosing proper learning rate is difficult.

26. What are GD optimization methods and which optimizer to use?

To overcome challenges in GD, some optimization methods are used by AI community.


Further, less efforts are required in hyperparameter tuning.
a) Adagrad
 Algorithm of choice in case of sparse data.
 Eliminate need of manually tuning learning rate unlike GD.
 Default value of 0.01 is preferred.
b) Adadelta
 Reduces adagrad’s monotonically decreasing learning rate.
 Do not require default learning rate.
c) RMSprop
 RMSprop and adadelta were developed for same purpose at same time.
 Learning rate = 0.001 is preferred.
d) Adam
 It works well with most of problems and is algorithm of choice.
 Seen as combination of RMSprop and momentum.
 AdaMax and Nadam are variants of Adam.
To sum up, DNN takes the input which is worked upon by activation function to make
computation and learn complex relationship within dataset. Then, loss is computed for entire
dataset based on actual and predicted values. Finally, to minimize the loss and making the
predicted values close to actual, weights are updated using optimizers. This process
continues till model converges with the motive of getting minimum loss value.

27. What is Convolutional Neural Network (CNN)?

“Convolution neural networks” indicates that these are simply neural networks with some
mathematical operation (generally matrix multiplication) in between their layers called
convolution. It was proposed by Yann LeCun in 1998. It's one of the most popular uses in
Image Classification. Convolution neural network can broadly be classified into these steps:
1. Input layer

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

2. Convolutional layer
3. Output layers

28. Explain the architecture of Convolutional Neural Networks (CNN)?

Input layers are connected with convolutional layers that perform many tasks such as
padding, striding, the functioning of kernels, and so many performances of this layer, this
layer is considered as a building block of convolutional neural networks. We will be
discussing it’s functioning in detail and how the fully connected networks work.
Convolutional Layer: The convolutional layer’s main objective is to extract features from
images and learn all the features of the image which would help in object detection
techniques. As we know, the input layer will contain some pixel values with some weight and
height, our kernels or filters will convolve around the input layer and give results which will
retrieve all the features with fewer dimensions. Let’s see how kernels work;

Formation and arrangement of Convolutional Kernels


With the help of this very informative visualization about kernels, we can see how the kernels
work and how padding is done.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Matrix visualization in CNN


Need for Padding: We can see padding in our input volume, we need to do padding in
order to make our kernels fit the input matrices. Sometimes we do zero paddings, i.e. adding
one row or column to each side of zero matrices or we can cut out the part, which is not
fitting in the input image, also known as valid padding. Let’s see how we reduce parameters
with negligible loss, we use techniques like Max-pooling and average pooling.
Max pooling or Average pooling:

Matrix formation using Max-pooling and average pooling

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Max pooling or average pooling reduces the parameters to increase the computation of our
convolutional architecture. Here, 2*2 filters and 2 strides are taken (which we usually use). By
name, we can easily assume that max-pooling extracts the maximum value from the filter
and average pooling takes out the average from the filter. We perform pooling to reduce
dimensionality. We have to add padding only if necessary. The more convolutional layer can
be added to our model until conditions are satisfied.

29. Explain activation functions in CNN?

An activation function is added to our network anywhere in between two convolutional


layers or at the end of the network. So you must be wondering what exactly an activation
function does, let me clear it in simple words for you. It helps in making the decision about
which information should fire forward and which not by making decisions at the end of any
network. In broadly, there are both linear as well as non-linear activation functions, both
performing linear and non-linear transformations but non-linear activation functions are a
lot helpful and therefore widely used in neural networks as well as deep learning networks.
The four most famous activation functions to add non-linearity to the network are described
below.
1. Sigmoid Activation Function
The equation for the sigmoid function is
f(x) = 1/(1+e-X )

Sigmoid Activation function


The sigmoid activation function is used mostly as it does its task with great efficiency, it
basically is a probabilistic approach towards decision making and ranges in between 0 to 1,
so when we have to make a decision or to predict an output we use this activation function
because of the range is the minimum, therefore, the prediction would be more accurate.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

2. Hyperbolic Tangent Activation Function(Tanh)

Tanh Activation function


This activation function is slightly better than the sigmoid function, like the sigmoid function
it is also used to predict or to differentiate between two classes but it maps the negative
input into negative quantity only and ranges in between -1 to 1.
3. ReLU (Rectified Linear unit) Activation function
Rectified linear unit or ReLU is the most widely used activation function right now which
ranges from 0 to infinity, all the negative values are converted into zero, and this
conversion rate is so fast that neither it can map nor fit into data properly which creates a
problem, but where there is a problem there is a solution.

Rectified Linear Unit activation function


We use Leaky ReLU function instead of ReLU to avoid this unfitting, in Leaky ReLU range is
expanded which enhances the performance.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

4. Softmax Activation Function


Softmax is used mainly at the last layer i.e output layer for decision making the same as
sigmoid activation works, the softmax basically gives value to the input variable according to
their weight, and the sum of these weights is eventually one.

Softmax activation function


For Binary classification, both sigmoid, as well as softmax, are equally approachable but in
the case of multi-class classification problems we generally use softmax and cross-entropy
along with it.

30. What’s need of activation functions?

―The world is one big data problem.‖


As it turns out—
This saying holds true both for our brains as well as machine learning.
Every single moment our brain is trying to segregate the incoming information into the
―useful‖ and ―not-so-useful‖ categories.

A similar process occurs in artificial neural network architectures in deep learning. The
segregation plays a key role in helping a neural network properly function, ensuring that it
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

learns from the useful information rather than get stuck analyzing the not-useful part. And
this is also where activation functions come into the picture. Activation Function helps the
neural network to use important information while suppressing irrelevant data points.

31. What is a Neural Network Activation Function?

An Activation Function decides whether a neuron should be activated or not. This means
that it will decide whether the neuron’s input to the network is important or not in the
process of prediction using simpler mathematical operations. The role of the Activation
Function is to derive output from a set of input values fed to a node (or a layer). But—Let’s
take a step back and clarify: What exactly is a node? Well, if we compare the neural network
to our brain, a node is a replica of a neuron that receives a set of input signals—external
stimuli.

Depending on the nature and intensity of these input signals, the brain processes them and
decides whether the neuron should be activated (―fired‖) or not. In deep learning, this is also
the role of the Activation Function—that’s why it’s often referred to as a Transfer Function in
Artificial Neural Network. The primary role of the Activation Function is to transform the
summed weighted input from the node into an output value to be fed to the next hidden
layer or as output.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Now, let's have a look at the Neural Networks Architecture.


Elements of a Neural Networks Architecture
Here’s the thing— If you don’t understand the concept of neural networks and how they
work, diving deeper into the topic of activation functions might be challenging.
That’s why it’s a good idea to refresh your knowledge and take a quick look at the structure
of the Neural Networks Architecture and its components. Here it is.

In the image above, you can see a neural network made of interconnected neurons. Each of
them is characterized by its weight, bias, and activation function.
Input Layer
The input layer takes raw input from the domain. No computation is performed at this layer.
Nodes here just pass on the information (features) to the hidden layer.
Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction
to the neural network.
The hidden layer performs all kinds of computation on the features entered through the
input layer and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through the hidden
layer and delivers the final value as a result.
Note: All hidden layers usually use the same activation function. However, the output layer
will typically use a different activation function from the hidden layers. The choice depends
on the goal or type of prediction made by the model.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Feedforward vs. Backpropagation


When learning about neural networks, you will come across two essential terms describing
the movement of information—feedforward and backpropagation. Let’s explore them.
Feedforward Propagation - the flow of information occurs in the forward direction. The input
is used to calculate some intermediate function in the hidden layer, which is then used to
calculate an output. In the feedforward propagation, the Activation Function is a
mathematical ―gate‖ in between the input feeding the current neuron and its output going
to the next layer.

Backpropagation - the weights of the network connections are repeatedly adjusted to


minimize the difference between the actual output vector of the net and the desired output
vector. To put it simply—backpropagation aims to minimize the cost function by adjusting
the network’s weights and biases. The cost function gradients determine the level of
adjustment with respect to parameters like activation function, weights, bias, etc.
Why do Neural Networks Need an Activation Function?
So we know what Activation Function is and what it does, but— Why do Neural Networks
need it? Well, the purpose of an activation function is to add non-linearity to the neural
network.

Activation functions introduce an additional step at each layer during the forward
propagation, but its computation is worth it. Here is why— Let’s suppose we have a neural
network working without the activation functions. In that case, every neuron will only be
performing a linear transformation on the inputs using the weights and biases. It’s because it

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

doesn’t matter how many hidden layers we attach in the neural network; all layers will
behave in the same way because the composition of two linear functions is a linear function
itself. Although the neural network becomes simpler, learning any complex task is
impossible, and our model would be just a linear regression model.
3 Types of Neural Networks Activation Functions
Now, as we’ve covered the essential concepts, let’s go over the most popular neural
networks activation functions.
Binary Step Function: Binary step function depends on a threshold value that decides
whether a neuron should be activated or not. The input fed to the activation function is
compared to a certain threshold; if the input is greater than it, then the neuron is activated,
else it is deactivated, meaning that its output is not passed on to the next hidden layer.

Binary Step Function


Mathematically it can be represented as:

Here are some of the limitations of binary step function:


 It cannot provide multi-value outputs—for example, it cannot be used for multi-class
classification problems.
 The gradient of the step function is zero, which causes a hindrance in the
backpropagation process.
Linear Activation Function: The linear activation function, also known as "no activation," or
"identity function" (multiplied x1.0), is where the activation is proportional to the input. The
function doesn't do anything to the weighted sum of the input, it simply spits out the value
it was given.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Linear Activation Function


Mathematically it can be represented as:

However, a linear activation function has two major problems :


 It’s not possible to use backpropagation as the derivative of the function is a
constant and has no relation to the input x.
 All layers of the neural network will collapse into one if a linear activation function is
used. No matter the number of layers in the neural network, the last layer will still be
a linear function of the first layer. So, essentially, a linear activation function turns the
neural network into just one layer.
Non-Linear Activation Functions
The linear activation function shown above is simply a linear regression model.
Because of its limited power, this does not allow the model to create complex mappings
between the network’s inputs and outputs.
Non-linear activation functions solve the following limitations of linear activation functions:
 They allow backpropagation because now the derivative function would be related to
the input, and it’s possible to go back and understand which weights in the input
neurons can provide a better prediction.
 They allow the stacking of multiple layers of neurons as the output would now be a
non-linear combination of input passed through multiple layers. Any output can be
represented as a functional computation in a neural network.
Now, let’s have a look at ten different non-linear neural networks activation functions and
their characteristics.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

32. Explain non-linear Neural Networks activation functions.

Sigmoid / Logistic Activation Function


This function takes any real value as input and outputs values in the range of 0 to 1.
The larger the input (more positive), the closer the output value will be to 1.0, whereas the
smaller the input (more negative), the closer the output will be to 0.0, as shown below.

Sigmoid/Logistic Activation Function


Mathematically it can be represented as:

Here’s why sigmoid/logistic activation function is one of the most widely used functions:
 It is commonly used for models where we have to predict the probability as an
output. Since probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice because of its range.
 The function is differentiable and provides a smooth gradient, i.e., preventing jumps
in output values. This is represented by an S-shape of the sigmoid activation
function.
The limitations of sigmoid function are discussed below:
 The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

The derivative of the Sigmoid Activation Function


As we can see from the above Figure, the gradient values are only significant for range -3 to
3, and the graph gets much flatter in other regions. It implies that for values greater than 3
or less than -3, the function will have very small gradients. As the gradient value approaches
zero, the network ceases to learn and suffers from the Vanishing gradient problem. The
output of the logistic function is not symmetric around zero. So the output of all the neurons
will be of the same sign. This makes the training of the neural network more difficult and
unstable.
Tanh Function (Hyperbolic Tangent)
Tanh function is very similar to the sigmoid/logistic activation function, and even has the
same S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input
(more positive), the closer the output value will be to 1.0, whereas the smaller the input
(more negative), the closer the output will be to -1.0.

Tanh Function (Hyperbolic Tangent)


Mathematically it can be represented as:

Advantages of using this activation function are:


 The output of the tanh activation function is Zero centered; hence we can easily map
the output values as strongly negative, neutral, or strongly positive.
 Usually used in hidden layers of a neural network as its values lie between -1 to;
therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps
in centering the data and makes learning for the next layer much easier.
Have a look at the gradient of the tanh activation function to understand its limitations.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Gradient of the Tanh Activation Function


As you can see— it also faces the problem of vanishing gradients similar to the sigmoid
activation function. Plus the gradient of the tanh function is much steeper as compared to
the sigmoid function. Note: Although both sigmoid and tanh face vanishing gradient issue,
tanh is zero centered, and the gradients are not restricted to move in a certain direction.
Therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.
ReLU Function
ReLU stands for Rectified Linear Unit. Although it gives an impression of a linear function,
ReLU has a derivative function and allows for backpropagation while simultaneously making
it computationally efficient. The main catch here is that the ReLU function does not activate
all the neurons at the same time. The neurons will only be deactivated if the output of the
linear transformation is less than 0.

ReLU Activation Function


Mathematically it can be represented as:

The advantages of using ReLU as an activation function are as follows:


 Since only a certain number of neurons are activated, the ReLU function is far more
computationally efficient when compared to the sigmoid and tanh functions.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 ReLU accelerates the convergence of gradient descent towards the global minimum
of the loss function due to its linear, non-saturating property.
Softmax Function
Before exploring the ins and outs of the Softmax activation function, we should focus on its
building block—the sigmoid/logistic activation function that works on calculating probability
values.

Probability
The output of the sigmoid function was in the range of 0 to 1, which can be thought of as
probability. But — This function faces certain problems. Let’s suppose we have five output
values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we move forward with it? The
answer is: We can’t. The above values don’t make sense as the sum of all the classes/output
probabilities should be equal to 1. You see, the Softmax function is described as a
combination of multiple sigmoids. It calculates the relative probabilities. Similar to the
sigmoid/logistic activation function, the SoftMax function returns the probability of each
class. It is most commonly used as an activation function for the last layer of the neural
network in the case of multi-class classification. Mathematically it can be represented as:

Softmax Function
Let’s go over a simple example together.
Assume that you have three classes, meaning that there would be three neurons in the
output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68]. Applying the
softmax function over these values to give a probabilistic view will result in the following
outcome: [0.58, 0.23, 0.19]. The function returns 1 for the largest probability index while it
returns 0 for the other two array indexes. Here, giving full weight to index 0 and no weight
to index 1 and index 2. So the output would be the class corresponding to the 1st
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

neuron(index 0) out of three. You can see now how softmax activation function make things
easy for multi-class classification problems.
Scaled Exponential Linear Unit (SELU)
SELU was defined in self-normalizing networks and takes care of internal normalization
which means each layer preserves the mean and variance from the previous layers. SELU
enables this normalization by adjusting the mean and variance. SELU has both positive and
negative values to shift the mean, which was impossible for ReLU activation function as it
cannot output negative values. Gradients can be used to adjust the variance. The activation
function needs a region with a gradient larger than one to increase it.

SELU Activation Function


Mathematically it can be represented as:

SELU has values of alpha α and lambda λ predefined.


Here’s the main advantage of SELU over ReLU: Internal normalization is faster than external
normalization, which means the network converges faster. SELU is a relatively newer
activation function and needs more papers on architectures such as CNNs and RNNs, where
it is comparatively explored.

33. Why are deep neural networks hard to train?

There are two challenges you might encounter when training your deep neural networks.
Let's discuss them in more detail.
Vanishing Gradients: Like the sigmoid function, certain activation functions squish an
ample input space into a small output space between 0 and 1. Therefore, a large change in
the input of the sigmoid function will cause a small change in the output. Hence, the
derivative becomes small. For shallow networks with only a few layers that use these

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

activations, this isn’t a big problem. However, when more layers are used, it can cause the
gradient to be too small for training to work effectively.
Exploding Gradients: Exploding gradients are problems where significant error gradients
accumulate and result in very large updates to neural network model weights during
training. An unstable network can result when there are exploding gradients, and the
learning cannot be completed. The values of the weights can also become so large as to
overflow and result in something called NaN values.

34. How to choose the right Activation Function?

You need to match your activation function for your output layer based on the type of
prediction problem that you are solving—specifically, the type of predicted variable.
Here’s what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and then move
over to other activation functions if ReLU doesn’t provide optimum results.
And here are a few other guidelines to help you out.
1. ReLU activation function should only be used in the hidden layers.
2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they
make the model more susceptible to problems during training (due to vanishing
gradients).
3. Swish function is used in neural networks having a depth greater than 40 layers.
Finally, a few rules for choosing the activation function for your output layer based on the
type of prediction problem that you are solving:
1. Regression - Linear Activation Function
2. Binary Classification—Sigmoid/Logistic Activation Function
3. Multiclass Classification—Softmax
4. Multilabel Classification—Sigmoid
The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.
5. Convolutional Neural Network (CNN): ReLU activation function.
6. Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Summary:
 Activation Functions are used to introduce non-linearity in the network.
 A neural network will almost always have the same activation function in all hidden
layers. This activation function should be differentiable so that the parameters of the
network are learned in backpropagation.
 ReLU is the most commonly used activation function for hidden layers.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 While selecting an activation function, you must consider the problems it might face:
vanishing and exploding gradients.
 Regarding the output layer, we must always consider the expected value range of the
predictions. If it can be any numeric value (as in case of the regression problem) you
can use the linear activation function or ReLU.
 Use Softmax or Sigmoid function for the classification problems.

35. How does deep learning attain such impressive results?

In a word, accuracy. Deep learning achieves recognition accuracy at higher levels than ever
before. This helps consumer electronics meet user expectations, and it is crucial for safety-
critical applications like driverless cars. Recent advances in deep learning have improved to
the point where deep learning outperforms humans in some tasks like classifying objects in
images. While deep learning was first theorized in the 1980s, there are two main reasons it
has only recently become useful:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

1. Deep learning requires large amounts of labeled data. For example, driverless car
development requires millions of images and thousands of hours of video.
2. Deep learning requires substantial computing power. High-performance GPUs have
a parallel architecture that is efficient for deep learning. When combined with
clusters or cloud computing, this enables development teams to reduce training time
for a deep learning network from weeks to hours or less.

36. State examples of Deep Learning.

Deep learning applications are used in industries from automated driving to medical devices.
Automated Driving: Automotive researchers are using deep learning to automatically
detect objects such as stop signs and traffic lights. In addition, deep learning is used to
detect pedestrians, which helps decrease accidents.
Aerospace and Defense: Deep learning is used to identify objects from satellites that locate
areas of interest, and identify safe or unsafe zones for troops.
Medical Research: Cancer researchers are using deep learning to automatically detect
cancer cells. Teams at UCLA built an advanced microscope that yields a high-dimensional
data set used to train a deep learning application to accurately identify cancer cells.
Industrial Automation: Deep learning is helping to improve worker safety around heavy
machinery by automatically detecting when people or objects are within an unsafe distance
of machines.
Electronics: Deep learning is being used in automated hearing and speech translation. For
example, home assistance devices that respond to your voice and know your preferences are
powered by deep learning applications.

37. What's the Difference Between Machine Learning and Deep Learning?

Deep learning is a specialized form of machine learning. A machine learning workflow starts
with relevant features being manually extracted from images. The features are then used to
create a model that categorizes the objects in the image. With a deep learning workflow,
relevant features are automatically extracted from images. In addition, deep learning
performs ―end-to-end learning‖ – where a network is given raw data and a task to perform,
such as classification, and it learns how to do this automatically. Another key difference is
deep learning algorithms scale with data, whereas shallow learning converges. Shallow
learning refers to machine learning methods that plateau at a certain level of performance
when you add more examples and training data to the network. A key advantage of deep
learning networks is that they often continue to improve as the size of your data increases. In
machine learning, you manually choose features and a classifier to sort images. With deep
learning, feature extraction and modeling steps are automatic.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

https://in.mathworks.com/videos/ai-for-engineers-building-an-ai-system-1603356830725.html
https://in.mathworks.com/videos/object-recognition-deep-learning-and-machine-learning-for-
computer-vision-121144.html
https://in.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-
networks--1489512765771.html

Figure: Comparing a machine learning approach to categorizing vehicles (left) with deep
learning (right).

38. Choosing Between Machine Learning and Deep Learning

 Machine learning offers a variety of techniques and models you can choose based on
your application, the size of data you're processing, and the type of problem you
want to solve.
 A successful deep learning application requires a very large amount of data
(thousands of images) to train the model, as well as GPUs, or graphics processing
units, to rapidly process your data.
 When choosing between machine learning and deep learning, consider whether you
have a high-performance GPU and lots of labeled data.
 If you don’t have either of those things, it may make more sense to use machine
learning instead of deep learning.
 Deep learning is generally more complex, so you’ll need at least a few thousand
images to get reliable results.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

 Having a high-performance GPU means the model will take less time to analyze all
those images.

39. Application oriented questions

Explain role of reinforcement learning in following example. Identify environment, agent,


different actions, reward, punishment etc. Draw its block diagram.

The reinforcement learning provides the means for robots to learn complex behavior from
interaction on the basis of generalizable behavioural primitives. From the human negative
feedback, the robot learns from its own misconduct.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Agent Reward Punishment

Different actions Environment

Feel free to contact me on +91-8329347107 calling / +91-9922369797 whatsapp,


email ID: [email protected] and [email protected]

*********************

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)

View publication stats


Artificial Intelligence & Machine Learning
Course Code: 302049

Unit 6: Applications
Third Year Bachelor of Engineering (Choice Based Credit System)
Mechanical Engineering (2019 Course)
Board of Studies – Mechanical and Automobile Engineering, SPPU, Pune
(With Effect from Academic Year 2021-22)

Question bank and its solution


by

Abhishek D. Patange, Ph.D.


Department of Mechanical Engineering
College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 6: APPLICATIONS

Unit 6: Applications
Syllabus:
Content Theory
Human Machine Interaction
Predictive Maintenance and Health Management
Fault Detection
Dynamic System Order Reduction
Image based part classification
Process Optimization
Material Inspection
Tuning of control algorithms

Type of question and marks:


Type Theory
Marks 2 or 4 or 6

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

Topic: Human Machine Interaction

1. What is human-machine interaction?

 HMI is all about how people and automated systems interact and communicate with
each other. That has long ceased to be confined to just traditional machines in industry
and now also relates to computers, digital systems or devices for the Internet of Things
(IoT).
 More and more devices are connected and automatically carry out tasks. Operating all of
these machines, systems and devices needs to be intuitive and must not place excessive
demands on users.
 Human-machine interaction is all about how people and automated systems interact
with each other.
 HMI now plays a major role in industry and everyday life: More and more devices are
connected and automatically carry out tasks.
 A user interface that is as intuitive as possible is therefore needed to enable smooth
operation of these machines. That can take very different forms.

2. How does human-machine interaction work?

 Smooth communication between people and machines requires interfaces: The place
where or action by which a user engages with the machine.
 Simple examples are light switches or the pedals and steering wheel in a car: An action is
triggered when you flick a switch, turn the steering wheel or step on a pedal.
 However, a system can also be controlled by text being keyed in, a mouse, touch screens,
voice or gestures.
 The devices are either controlled directly: Users touch the smartphone’s screen or issue a
verbal command. Or the systems automatically identify what people want: Traffic lights
change color on their own when a vehicle drives over the inductive loop in the road’s
surface.
 Other technologies are not so much there to control devices, but rather to complement
our sensory organs. One example of that is a virtual reality glass.
 There are also digital assistants: Chatbots, for instance, reply automatically to requests
from customers and keep on learning.
 User interfaces in HMI are the places where or actions by which the user engages with
the machine.
 A system can be operated by means of buttons, a mouse, touch screens, voice or
gesture, for instance.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

 One simple example is a light switch – the interface between the machine ―light‖ and a
human being.
 It is also possible to differentiate further between direct control, such as tapping a touch
screen, and automatic control.
 In the latter case, the system itself identifies what people want.
 Think of traffic lights which change color as soon as a vehicle drives over the inductive
loop in the road’s surface.

3. What human-machine systems are there?

 For a long time, machines were mainly controlled by switches, levers, steering wheels or
buttons; these were joined later by the keyboard and mouse.
 Now we are in the age of the touch screen. Body sensors in wearables that automatically
collect data are also modern interfaces.
 Voice control is also making rapid advances: Users can already control digital assistants,
such as Amazon Alexa or Google Assistant, by voice.
 That entails far less effort. Chatbots are also used in such systems and their ability to
communicate with people is improving more and more thanks to artificial intelligence.

4. What trends are there in human-machine interaction?

 Gesture control is at least as intuitive as voice control. That means robovacs, for example,
could be stopped by a simple hand signal in the future.
 Google and Infineon have already developed a new type of gesture control by the name
of ―Soli‖:
 Devices can also be operated in the dark or remotely with the aid of radar technology.
 Technologies that augment reality now also act as an interface. Virtual reality glasses
immerse users in an artificially created 3D world, while augmented reality glasses
superimpose virtual elements in the real environment.
 Mixed reality glasses combine both technologies, thus enabling scenarios to be
presented realistically thanks to their high resolution.

5. What opportunities and challenges arise from human-machine interaction?

 Modern HMI helps people to use even very complex systems with ease. Machines also
keep on getting better at interpreting signals – and that is important in particular in
autonomous driving.
 Human needs are identified even more accurately, which means robots can be used in
caring for people, for instance. One potential risk is the fact that hackers might obtain
information on users via the machines’ sensors.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

 Last but not least, security is vital in human-machine interaction. Some critics also fear
that self-learning machines may become a risk by taking actions autonomously.
 It is also necessary to clarify the question of who is liable for accidents caused by HMI.

6. Where is human-machine interaction headed?

 Whether voice and gesture control or virtual, augmented and mixed reality, HMI
interaction is far from reaching the end of the line.
 In future, data from different sensors will also increasingly be combined to capture and
control complex processes optimally.
 The human senses will be replicated better and better with the aid of, for example, gas
sensors, 3D cameras and pressure sensors, thus expanding the devices’ capabilities.
 In contrast, there will be fewer of the input devices that are customary at present, such as
remote controllers.

7. What are the opportunities and challenges HMI?

 Even complex systems will become easier to use thanks to modern human-machine
interaction. To enable that, machines will adapt more and more to human habits and
needs. Virtual reality, augmented reality and mixed reality will also allow them to be
controlled remotely. As a result, humans expand their realm of experience and field of
action.
 Machines will also keep on getting better at interpreting signals in future – and that’s
also necessary: The fully autonomous car must respond correctly to hand signals from a
police officer at an intersection. Robots used in care must likewise be able to ―assess‖ the
needs of people who are unable to express these themselves.
 The more complex the contribution made by machines is, the more important it is to
have efficient communication between them and users. Does the technology also
understand the command as it was meant? If not, there’s the risk of misunderstandings –
and the system won’t work as it should. The upshot: A machine produces parts that don’t
fit, for example, or the connected car strays off the road.
 People, with their abilities and limitations, must always be taken into account in the
development of interfaces and sensors. Operating a machine must not be overly complex
or require too much familiarization. Smooth communication between human and
machine also needs the shortest possible response time between command and action,
otherwise users won’t perceive the interaction as being natural.
 One potential risk arises from the fact that machines are highly dependent on sensors to
be controlled or respond automatically. If hackers have access to the data, they obtain
details of the user’s actions and interests. Some critics also fear that even learning
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 6: APPLICATIONS

machines might act autonomously and subjugate people. One question that has also not
been clarified so far is who is liable for accidents caused by errors in human-machine
interaction, and who is responsible for them.
Reference: https://www.infineon.com/cms/en/discoveries/human-machine-interaction/

Topic: Fault Detection / Predictive Maintenance / Health Management

8. Make a list of maintenance and explain in brief. Discuss the scope of AIML.

Predictive maintenance: Predictive maintenance is used for


 Identify anomalies in the process, which help in preventive maintenance.
 Estimate the demand for product, raw material etc.: based on historical data and
current scenario.
 Forecast possible outcomes based on data obtained from the process.
Prescriptive maintenance: Prescriptive maintenance is used to identify ways in which an
industrial process can be improved. While predictive maintenance tells when could a
component/asset fails, prescriptive analytics tells what action you need to take to avoid the
failure. So, you can use the results obtained from prescriptive analysis to plan the
maintenance schedule, review your supplier, etc. Prescriptive maintenance also helps you
manage complex problems in the production process using relevant information.
Descriptive maintenance: The core purpose of descriptive maintenance is to describe the
problem by diagnosing the symptoms. This method also helps discover the trends and
patterns based on historical data. The results of a descriptive maintenance are usually shown
in the form of charts and graphs. These data visualization tools make it easy for all the
stakeholders, even those who are non-technical to understand the problems in the
manufacturing process.
Diagnostic maintenance: Diagnostic maintenance is also referred to as root cause analysis.
While descriptive maintenance can tell what happened based on historical data, diagnostic
maintenance tells you why it happened. Data mining, data discover, correlation, and down
and drill through methods are used in diagnostic analytics. Diagnostic maintenance can be
used to identify cause for equipment malfunction or reason for the drop in the product
quality.

9. Explain fault diagnosis (of any suitable machine element) using ML.

Refer following articles and explain the procedure they have adopted.
 Sakthivel, N. R., Sugumaran, V., & Babudevasenapati, S. (2010). Vibration based fault
diagnosis of monoblock centrifugal pump using decision tree. Expert Systems with
Applications, 37(6), 4040-4049.
https://www.sciencedirect.com/science/article/pii/S0957417409008689
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 6: APPLICATIONS

 Sugumaran, V., Muralidharan, V., & Ramachandran, K. I. (2007). Feature selection


using decision tree and classification through proximal support vector machine for
fault diagnostics of roller bearing. Mechanical systems and signal processing, 21(2),
930-942.
https://www.sciencedirect.com/science/article/pii/S0888327006001142
 Patange, A. D., & Jegadeeshwaran, R. (2021). A machine learning approach for
vibration-based multipoint tool insert health prediction on vertical machining centre
(VMC). Measurement, 173, 108649.
https://www.sciencedirect.com/science/article/pii/S0263224120311659
 Patange, A. D., & Jegadeeshwaran, R. (2020). Application of bayesian family classifiers
for cutting tool inserts health monitoring on CNC milling. International Journal of
Prognostics and Health Management, 11(2).
http://papers.phmsociety.org/index.php/ijphm/article/view/2929

Topic: Image based part classification

10. Explain an intelligent approach for classification of Nuts, Bolts, Washers and

Locating Pins?

An intelligent approach to classify Nuts, Bolts, Washers and Locating Pins as our Cats and
Dogs is explained here.

Bolt or Nut or Locating Pin or Washer? Will the AI be able to tell?


So how does it work? An algorithm is able to classify images (efficiently) by using a Machine
Learning algorithm called Convolutional Neural Networks (CNN) a method used in Deep
Learning. We will be using a simple version of this model called Sequential to let our model
distinguish the images into four classes Nuts, Bolts, Washers and Locating Pins. The model
will learn by ―observing‖ a set of training images. After learning we will see how accurately it
can predict what an image (which it has not seen) is.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

A flowchart of a Machine Learning algorithm trained on Images of Nuts and Bolts using a
Neural Network Model.
Data-set
We downloaded 238 parts each of the 4 classes (Total 238 x 4 = 952) from various part
libraries available on the internet. Then we took 8 different isometric images of each part.
This was done to augment the data available, as only 238 images for each part would not be
enough to train a good neural network. A single class now has 1904 images (8 isometric
images of 238 parts) a total of 7616 images. Each image is of 224 x 224 pixels.

Images of the 4 classes. 1 part has 8 images. Each image is treated as single data. We then
have our labels with numbers 0,1,2,3 each number corresponds to a particular image and
means it belongs to certain class #Integers and their corresponding classes
{0: 'locatingpin', 1: 'washer', 2: 'bolt', 3: 'nut'} After training on the above images we will then
see how well our model predicts a random image it has not seen.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

Methodology
The process took place in 7 steps. We will get to the details later. The brief summary is
1. Data Collection : The data for each class was collected from various standard part
libraries on the internet.
2. Data Preparation : 8 Isometric view screenshots were taken from each image and
reduced to 224 x 224 pixels.
3. Model Selection : A Sequential CNN model was selected as it was simple and good
for image classification
4. Train the Model: The model was trained on our data of 7616 images with 80/20
train-test split
5. Evaluate the Model: The results of the model were evaluated. How well it predicted
the classes?
6. Hyperparameter Tuning: This process is done to tune the hyperparameters to get
better results . We have already tuned our model in this case
7. Make Predictions: Check how well it predicts the real world data
Data Collection
We downloaded the part data of various nuts and bolts from the different part libraries on
the internet. These websites have numerous 3D models for standard parts from various
makers in different file formats. Since we will be using FreeCAD API to extract the images we
downloaded the files in neutral format (STEP).

Flowchart of the CAD model download


As already mentioned earlier, 238 parts from each of the 4 class was downloaded, that was a
total of 952 parts.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

Data Preparation
Then we ran a program using FreeCAD API that automatically took 8 isometric screenshots
of 224 x 224 pixels of each part. FreeCAD is a free and open-source general-purpose
parametric 3D computer-aided design modeler which is written in Python.

A flowchart of how the data was created


As already mentioned above, each data creates 8 images of 224 x 224 pixels. So we now
have a total of 1904 image from each of the 4 classes, thus a total of 7616 images. Each
image is treated as a separate data even though 8 images come from the same part.

8 isometric images of a 2 bolts. Each row represents a different part.


The images were kept in separated folders according to their class. i.e. we have four folders
Nut,Bolt, Washer and Locating Pin.
Next, each of these images were converted into an array with their pixel values in grayscale.
The value of the pixels range from 0 (black), 255 (white). So its actually 255 shades of gray.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

Example of an Image converted to an array of pixel values. (Source: openframeworks.cc)


Now each of our image becomes a 224 x 224 array. So our entire dataset is a 3D array of
7616 x 224 x 224 dimensions.
7616 (No. of images) x 224 x 224 (pixel value of each image)

Visualization of our pixel array using matplot.lib


Similarly we create a the label dataset by giving the value of the following integers for the
shown classes to corresponding indexes in the dataset. If our 5th(index) data in the
dataset(X) is a locating pin , the 5th data in label set (Y) will have value 0.
#integers and the corresponding classes as already mentioned above
{0: 'locatingpin', 1: 'washer', 2: 'bolt', 3: 'nut'}
Model Selection
Since this is an image recognition problem we will be using a Convolutional Neural Network
(CNN). CNN is a type of Neural Network that handles image data especially well. A Neural
Network is a type of Machine learning algorithm that learns in a similar manner to a human
brain.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

A basic neural network.

A Convolutional Neural network. A basic visualization of how our algorithm will work
The following code is how our CNN looks like. Don’t worry about it if you don’t understand.
The idea is the 224 x 224 features from each of our data will go through these network and
spit out an answer. The model will adjusts its weights accordingly and after many iterations
will be able to predict a random image’s class.
#Model description
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
===========================================================
======
conv2d_1 (Conv2D) (None, 222, 222, 128) 1280
_________________________________________________________________
activation_1 (Activation) (None, 222, 222, 128) 0

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 111, 111, 128) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 109, 109, 128) 147584
_________________________________________________________________
activation_2 (Activation) (None, 109, 109, 128) 0
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 54, 54, 128) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 373248) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 23887936
_________________________________________________________________
dense_2 (Dense) (None, 4) 260
_________________________________________________________________
activation_3 (Activation) (None, 4) 0
===========================================================
======
Total params: 24,037,060
Trainable params: 24,037,060
Non-trainable params: 0
Model Training
Now finally the time has come to train the model using our dataset of 7616 images. So our
[X] is a 3D array of 7616 x 224 x224 and [y] label set is a 7616 x 1 array. For all training
purposes a data must be split into at least two parts: Training and Validation (Test) set (test
and validation are used interchangeably when only 2 sets are involved).

Data being split into training and test set.


The training set is the data the model sees and trains on. It is the data from which it adjusts
its weights and learn. The accuracy of our model on this set is the training accuracy. It is
generally higher than the validation accuracy.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

The validation data usually comes from the same distribution as the training set and is the
data the model has not seen. After the model has trained from the training set, it will try to
predict the data of the validation set. How accurately it predicts this, is our validation
accuracy. This is more important than the training accuracy. It shows how well the model
generalizes. In real life application it is common to split it even into three parts. Train,
Validation and Test. For our case we will only split it into a training and test set. It will be a
80–20 split. 80 % of the images will be used for training and 20% will be used for testing.
That is train on 6092 samples, test on 1524 samples from the total 7616.

Visualization of the model training.


For our model we trained for 15 epochs with a batch-size of 64.
The number of epochs is a hyperparameter that defines the number times that the learning
algorithm will work through the entire training dataset.
One epoch means that each sample in the training dataset has had an opportunity to update
the internal model parameters. An epoch is comprised of one or more batches.
You can think of a for-loop over the number of epochs where each loop proceeds over the
training dataset. Within this for-loop is another nested for-loop that iterates over each batch
of samples, where one batch has the specified ―batch size‖ number of samples. [2]
That is our model will go through our entire 7616 samples 15 times (epoch) in total and
adjust its weights each time so the prediction is more accurate each time. In each epoch, it
will go through the 7616 samples, 64 samples (batch size) at a time.
Evaluate the model
The model keeps updating its weight so as to minimize the cost(loss), thus giving us the best
accuracy. Cost is a measure of inaccuracy of the model in predicting the class of the image.
Cost functions are used to estimate how badly models are performing. Put simply, a cost
function is a measure of how wrong the model is in terms of its ability to estimate the
relationship between X and y. [1]

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

If the algorithm predicts incorrectly the cost increases, if it predicts correct the cost
decreases.
After training for 15 epochs we can see the following graph of loss and accuracy. (Cost and
loss can be used interchangeably for our case)

Graph generated from matplot.lib showing Training and Validation loss for our model
The loss decreased as the model trained more times. It becomes better at classifying the
images with each epoch. The model is not able to improve the performance much on the
validation set.

Graph generated from matplot.lib showing Training and Validation accuracy for our model
The accuracy increased as the model trains for each epoch. It becomes better at classifying
the images. The accuracy is for the validation set is lower than the training set as it has not
trained on it directly. The final value is 97.64% which is not bad.
Hyperparameter Tuning
The next step would to be change the hyperparameters, the learning rate,number of epochs,
data size etc. to improve our model. In machine learning, a hyperparameter is a parameter
whose value is used to control the learning process. By contrast, the values of other
parameters (typically node weights) are derived via training.[3] For our purpose we have
already modified these parameters before this article was written, in a way to obtain an

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

optimum performance for display on this article. We increased the dataset size and number
of epochs to improve the accuracy.

Hyperparameters affect parameters and eventually the final score (accuracy)


Make Predictions
The final step after making the adjustments on the model is to make predictions using actual
data that will be used on this model. If the model does not perform well on this further
hyperparameter tuning can commence.
Machine Learning is a rather iterative and empirical process and thus the tuning of
hyperparameters is often compared to an art rather than science as although we have an
idea of what changes will happen by changing certain hyperparameters, we cannot be
certain of it.

The machine learning algorithm flowchart


Applications
This ability to classify mechanical parts could enable us to recommend parts from a standard
library based only on an image or a CAD model provided by the customer. Currently to
search for a required part from a standard library you have to go through a catalogue and be
able to tell which part you want based on the available options and your knowledge of the

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

catalogue. There are serial codes to remember as a change in a single digit or alphabet
might mean a different type of part.

Example of a part number


If an image can be used to get the required part from the standard library, all we will need to
do is to make a rough CAD model of it and send it through our algorithm. The algorithm will
decide which parts are best and help narrow down our search significantly.

Visualisation of how the recommendation algorithm would work


If the classification method gets detailed and fine-tuned enough it should be able to classify
with much detail what type of part you want. The narrowed search saves a lot of time. This is
especially useful in a library where there are thousands of similar parts.

Topic: Material inspection

11. Explain scope of ML for Materials Science.

Reference: Five High-Impact Research Areas in Machine Learning for Materials Science by
Bryce Meredig. https://pubs.acs.org/doi/10.1021/acs.chemmater.9b04078
Over the past several years, the field of materials informatics has grown dramatically. (1)
Applications of machine learning (ML) and artificial intelligence (AI) to materials science are
now commonplace. As materials informatics has matured from a niche area of research into
an established discipline, distinct frontiers of this discipline have come into focus, and best

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

practices for applying ML to materials are emerging. (2) The purpose of this editorial is to
outline five broad categories of research that, in my view, represent particularly high-impact
opportunities in materials informatics today:
 Validation by experiment or physics-based simulation. One of the most common
applications of ML in materials science involves training models to predict materials
properties, typically with the goal of discovering new materials. With the availability of
user-friendly, open-source ML packages such as scikit-learn, (3)keras, (4) and pytorch, (5)
the process of training a model on a materials data set—which requires only a few lines
of python code—has become completely commoditized. Thus, standard practice in
designing materials with ML should include some form of validation, ideally by
experiment (6−8) or, in some cases, by physics-based simulation. (9,10) Of particular
interest are cases in which researchers use ML to identify materials whose properties are
superior to those of any material in the initial training set; (11) such extraordinary results
remain scarce.
 ML approaches tailored for materials data and applications. This category
encapsulates a diverse set of method development activities that make ML more
applicable to and effective for a wider range of materials problems. Materials science as a
field is characterized by small, sparse, noisy, multiscale, and heterogeneous
multidimensional (e.g., a blend of scalar property estimates, curves, images, time series,
etc.) data sets. At the same time, we are often interested in exploring very large, high-
dimensional chemistry and processing design spaces. Some method development
examples to address these challenges include new approaches for uncertainty
quantification (UQ), (12) extrapolation detection, (13) multiproperty optimization, (14)
descriptor development (i.e., the design of new materials representations for ML),
(15−17) materials-specific cross-validation, (18,19) ML-oriented data standards, (20,21)
and generative models for materials design. (22)
 High-throughput data acquisition capabilities. ML is notoriously data-hungry. Given
the typically very high cost of acquiring materials data, both in terms of time and money,
the materials informatics field is well-served by research that accelerates and
democratizes our ability to synthesize, characterize, and simulate materials. Examples
include high-throughput density functional theory calculations of materials properties,
(23−25) applications of robotics, automation, and operations research to materials
science, (26−30) and natural language processing (NLP) to extract materials data from
text corpora. (31,32)
 ML that makes us better scientists. A popular refrain in the materials informatics
community is that ―ML will not replace scientists, but scientists who use ML will replace

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

those who do not.‖ This bon mot suggests that ML has the potential to make scientists
more effective and enable them to do more interesting and impactful work. We are still
in the nascent stages of creating true ML-based copilots for scientists, but research areas
such as ML model explainability and interpretability (33,34) represent a valuable early
step. Another example is the application of ML to accelerate or simplify materials
characterization. Researchers have used deep learning to efficiently post-process and
understand images generated via existing characterization methods such as scanning
transmission electron microscopy (STEM) (35) and position averaged convergent beam
electron diffraction (PACBED). (36)
 Integration of physics within ML, and ML with physics-based simulations. The
paucity of data in many materials applications is a strong motivator for formally
integrating known physics into ML models. One approach to embedding physics within
ML is to develop methods that guarantee certain desirable properties by construction,
such as respecting the invariances present in a physical system. (37) Another strategy is
to use ML to model the difference between simulation outputs and experimental results.
For example, Google and collaborators created TossingBot, a robotic system that learned
to throw objects into bins with the aid of a ballistics simulation. (38) The researchers
found that a physics-aware ML approach, wherein ML learned and corrected for the
discrepancy between the simulations and real-world observations, dramatically
outperformed a pure trial-and-error ML training strategy. In a similar vein, ML can enable
us to derive more value from existing physics-based simulations. For example, ML-based
interatomic potentials (39−41) represent a means of capturing some of the physics of
first-principles simulations in a much more computationally efficient model that can
simulate orders of magnitude more atoms. ML can also serve as ―glue‖ to link physics-
based models operating at various fidelities and length scales. (42)
As ML becomes more widely used in materials research, I expect that efforts addressing one
or more of these five themes will have an outsized impact on both the materials informatics
discipline and materials science more broadly.

12. Explain machine learning for materials design and discovery.

Reference: Vasudevan, R., Pilania, G., & Balachandran, P. V. (2021). Machine learning for
materials design and discovery. Journal of Applied Physics, 129(7), 070401.
https://doi.org/10.1063/5.0043300
Liu, Y., Zhao, T., Ju, W., & Shi, S. (2017). Materials discovery and design using machine
learning. Journal of Materiomics, 3(3), 159-177.
https://www.sciencedirect.com/science/article/pii/S2352847817300515

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

An overview of the application of machine learning in materials science.

The fundamental framework for the application of machine learning in material property
prediction.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

The general process of machine learning in materials science.

Topic: Process optimization

13. Explain process optimization using machine learning.

Refer following paper


https://www.icheme.org/media/12829/matthew-mcewan-a-hands-on-demonstration-of-
process-optimisation-using-machine-learning-techniques.pdf

Feel free to contact me on +91-8329347107 calling / +91-9922369797 whatsapp,


email ID: [email protected] and [email protected]

*********************

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)

View publication stats

You might also like