Unit 3: Classification & Regression: Question Bank and Its Solution
Unit 3: Classification & Regression: Question Bank and Its Solution
Theory questions
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub-trees. Below diagram explains the general structure of a decision tree.
The decision tree comprises of root node, leaf node, branch nodes, parent/child node etc.
following is the explanation of this terminology.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node. For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM)
i.e. information gain and Gini index.
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two
leaf nodes (Accepted offers and Declined offer). See the above figure.
The general idea is that we will segment the predictor space into a number of simple
regions. In order to make a prediction for a given observation, we typically use the mean of
the training data in the region to which it belongs. Since the set of splitting rules used to
segment the predictor space can be summarized by a tree such approaches are called
decision tree methods. These methods are simple and useful for interpretation. We want to
predict a response or class Y from inputs X1,X2, . . .Xp. We do this by growing a binary tree.
At each internal node in the tree, we apply a test to one of the inputs, say Xi . Depending on
the outcome of the test, we go to either the left or the right sub-branch of the tree. Eventually
we come to a leaf node, where we make a prediction. This prediction aggregates or
averages all the training data points which reach that leaf. In order to motivate regression
trees, we begin with a simple example. Our motivation is to predict a baseball player’s Salary
based on Years (the number of years that he has played in the major leagues) and Hits (the
number of hits that he made in the previous year). We first remove observations that are
missing Salary values and log-transform Salary so that its distribution has more of a typical
bell-shape. Recall that Salary is measured in thousands of dollars.
The tree represents a series of splits starting at the top of the tree. The top split assigns
observations having Years < 4.5 to the left branch. The predicted salary for these players is
given by the mean response value for the players in the data set with Years < 4.5.For such
players, the mean log salary is 5.107, and so we make a prediction of e5.107 thousands of
dollars, i.e. 165, 174. How would you interpret the rest (right branch) of the tree?
In keeping with the tree analogy, the regions R1, R2, and R3 are known as terminal
nodes or leaves of the tree.
As is the case for Figure 2, decision trees are typically drawn upside down, in the sense
that the leaves are at the bottom of the tree.
The points along the tree where the predictor space is split are referred to as internal
nodes.
In Figure 2, the two internal nodes are indicated by the text Years < 4.5 and Hits < 117.5.
We refer to the segments of the trees that connect the nodes as branches.
Years is the most important factor in determining Salary, and players with less experience
earn lower salaries than more experienced players.
Given that a player is less experienced, the number of hits that he made in the previous
year seems to play little role in his salary.
But among players who have been in the major leagues for five or more years, the
number of hits made in the previous year does affect salary, and players who made more
hits last year tend to have higher salaries.
The regression tree shown in Figure 2 is likely an over-simplification of the true
relationship between Hits, Years, and Salary, but it‘s a very nice easy interpretation over
more complicated approaches.
5. Explain entropy reduction, information gain and Gini index in decision tree.
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision
tree. A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S) – [(Weighted Average) * Entropy (each feature)]
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Where, S= Total number of samples, P(yes)= probability of yes, P(no)= probability of no
Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART (Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index. Gini index can be calculated using the formula: Gini Index= 1 – ∑jPj2
It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
7. Many times while training decision tree tends to overfit. What is the reason
Decision tree tends to overfit since at each node, it will make the decision among a subset of
all the features (columns), so when it reaches a final decision, it is a complicated and long
decision chain. Only if a data point satisfies all the rules along this chain, the final decision
can be made. This kind of specific rule on training dataset make it very specific for the
training set, on the other hand, cannot generalize well for new data points that it has never
seen. Especially when your dataset has many features (high dimension), it tends to overfit
more. In J48 decision tree, over fitting happens when algorithm gets information with
exceptional attributes. This causes many fragmentations in the process distribution.
Statistically unimportant nodes with least examples are known as fragmentations. Usually
J48 algorithm builds trees and grows its branches ‗just deep enough to perfectly classify the
training examples‘. This approach performs better with noise free data. But most of the time
this strategy overfits the training examples with noisy data. At present there are two
strategies which are widely used to bypass this overfitting in decision tree learning. Those
are: 1) If tree grows taller, stop it from growing before it reaches the maximum point of
accurate classification of the training data. 2) Let the tree to over-fit the training data then
post-prune tree. By default, the decision tree model is allowed to grow to its full depth.
Pruning refers to a technique to remove the parts of the decision tree to prevent growing to
its full depth. By tuning the hyperparameters of the decision tree model one can prune the
trees and prevent them from overfitting. There are two types of pruning Pre-pruning and
Post-pruning. Now let's discuss the in-depth understanding and hands-on implementation
of each of these pruning techniques.
Pre-Pruning:
The pre-pruning technique refers to the early stopping of the growth of the decision tree.
The pre-pruning technique involves tuning the hyperparameters of the decision tree model
prior to the training pipeline. The hyperparameters of the decision tree including
max_depth, min_samples_leaf, min_samples_split can be tuned to early stop the growth
of the tree and prevent the model from overfitting.
Post-Pruning:
The Post-pruning technique allows the decision tree model to grow to its full depth, then
removes the tree branches to prevent the model from overfitting. Cost complexity pruning
(ccp) is one type of post-pruning technique. In case of cost complexity pruning, the
ccp_alpha can be tuned to get the best fit model.
Problems/Numerical
Problem 1:
If we decided to arbitrarily label all 4 gumballs as red, how often would one of the gumballs
is incorrectly labelled?
4 red and 0 blue:
The impurity measurement is 0 because we would never incorrectly label any of the 4 red
gumballs here. If we arbitrarily chose to label all the balls ‗blue‘, then our index would still be
0, because we would always incorrectly label the gumballs.
The gini score is always the same no matter what arbitrary class you take the probabilities of
because they always add to 0 in the formula above.
A gini score of 0 is the most pure score possible.
2 red and 2 blue:
The impurity measurement is 0.5 because we would incorrectly label gumballs wrong about
half the time. Because this index is used in binary target variables (0,1), a gini index of 0.5 is
the least pure score possible. Half is one type and half is the other. Dividing gini scores by
0.5 can help intuitively understand what the score represents. 0.5/0.5 = 1, meaning the
grouping is as impure as possible (in a group with just 2 outcomes).
3 red and 1 blue:
The impurity measurement here is 0.375. If we divide this by 0.5 for more intuitive
understanding we will get 0.75, which is the probability of incorrectly/correctly labeling.
Problem 2:
How does entropy work with the same gumball scenarios stated in problem 1?
4 red and 0 blue:
Unsurprisingly, the impurity measurement is 0 for entropy as well. This is the max purity
score using information entropy.
2 red and 2 blue:
The purity/impurity measurement is 0.811 here, a bit worse than the gini score.
Problem 3:
Calculate entropy for following example.
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
Problem 6:
Consider the training examples shown in Table below for a binary classification problem.
Problem 7:
Consider the training examples shown in Table below for a binary classification problem.
Theory questions
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output. The greater number of trees in the forest leads to higher
accuracy and prevents the problem of overfitting.
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random forest classifier:
There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
The below diagram explains the working of the Random Forest algorithm:
Below are some points that explain why we should use the Random Forest algorithm:
It takes less training time as compared to other algorithms.
It predicts output with high accuracy, even for the large dataset it runs efficiently.
It can also maintain accuracy when a large proportion of data is missing.
It can be used for both classifications as well as regression tasks.
Overfitting problem that is censorious and can make results poor but in case of the
random forest the classifier will not overfit if there are enough trees.
It can be used for categorical values as well.
10. How does the random forest tree work for classification?
Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:
Bagging: Given the training set of N examples, we repeatedly sample subsets of the
training data of size n where n is less than N. Sampling is done at random but with
replacement. This subsampling of a training set is called bootstrap aggregating, or
bagging, for short.
Random subspace method: If each training example has M features, we take a subset of
them of size m < M to train each estimator. So no estimator sees the full training set,
each estimator sees only m features of n training examples.
Training estimators: We create Ntree decision trees, or estimators, and train each one on
a different set of m features and n training examples. The trees are not pruned, as they
would be in the case of training a simple decision tree classifier.
Perform inference by aggregating predictions of estimators: To make a prediction
for a new incoming example, we pass the relevant features of this example to each of the
Ntree estimators. We will obtain Ntree predictions, which we need to combine to produce
the overall prediction of the random forest. In the case of classification, we will use
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION
majority voting to decide on the predicted class, and in the case of regression, we will
take the mean value of the predictions of all the estimators.
12. What are advantages and limitations of the random forest tree?
13. What is the difference between simple decision tree and random forest tree?
overfitting. And they are complex to understand. A decision tree is easy to read and
understand whereas random forest is more complicated to interpret. A single decision tree is
not accurate in predicting the results but is fast to implement. More trees will give a more
robust model and prevents overfitting. In the forest, we need to generate process and
analyze each and every tree. Therefore this process is a slow process and can sometimes take
hours or even days.
The Ensemble learning helps improve machine learning results by combining several models.
This approach allows the production of better predictive performance compared to a single
model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Bagging
and Boosting are two types of Ensemble Learning. These two decrease the variance of a
single estimate as they combine several estimates from different models. So the result may
be a model with higher stability. Let‘s understand these two terms in a glimpse.
1. Bagging: It is a homogeneous weak learners‘ model that learns from each other
independently in parallel and combines them for determining the model average.
2. Boosting: It is also a homogeneous weak learners‘ model but works differently from
Bagging. In this model, learners learn sequentially and adaptively to improve model
predictions of a learning algorithm.
Bagging: Bootstrap Aggregating, also known as bagging, is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine learning
algorithms used in statistical classification and regression. It decreases the variance and
helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special
case of the model averaging approach.
Description of the Technique
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each training
set D < i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the
votes and assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
Step 2: A base model is created on each of these subsets.
Step 3: Each model is learned in parallel from each training set and independent of each
other.
Step 4: The final predictions are determined by combining the predictions from all the
models.
An illustration presenting the intuition behind the boosting algorithm, consisting of the
parallel learners and weighted dataset
Similarities between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal similarity
of being classified as ensemble methods. Here we will explain the similarities between them.
Both are ensemble methods to get N learners from 1 learner.
Both generate several training data sets by random sampling.
Both make the final decision by averaging the N learners (or taking the majority of them
i.e Majority Voting).
Both are good at reducing variance and provide higher stability.
Differences between Bagging and Boosting
SN Bagging Boosting
1. The simplest way of combining predictions A way of combining predictions that
that belongs to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3. Each model receives equal weight. Models are weighted according to
their performance.
4. Each model is built independently. New models are influenced
by the performance of previously built
models.
5. Different training data subsets are randomly Every new subset contains the
drawn with replacement from the entire elements that were misclassified by
training dataset. previous models.
6. Bagging tries to solve the over-fitting Boosting tries to reduce bias.
problem.
7. If the classifier is unstable (high variance), If the classifier is stable and simple
then apply bagging. (high bias) the apply boosting.
8. Example: The Random forest model uses Example: The AdaBoost uses Boosting
Bagging. techniques
There‘s not an outright winner; it depends on the data, the simulation and the
circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine
several estimates from different models. So the result may be a model with higher
stability.
If the problem is that the single model gets a very low performance, Bagging will rarely
get a better bias. However, Boosting could generate a combined model with lower
errors as it optimises the advantages and reduces pitfalls of the single model.
By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best
option. Boosting for its part doesn‘t help to avoid over-fitting; in fact, this technique is
faced with this problem itself. Thus, Bagging is effective more often than Boosting.
16. What are the main advantages of using a random forest versus a single
decision tree?
In an ideal world, we'd like to reduce both bias-related and variance-related errors. This issue
is well-addressed by random forests. A random forest is nothing more than a series of
decision trees with their findings combined into a single final result. They are so powerful
because of their capability to reduce overfitting without massively increasing error due to
bias. Random forests, on the other hand, are a powerful modelling tool that is far more
resilient than a single decision tree. They combine numerous decision trees to reduce
overfitting and bias-related inaccuracy, and hence produce usable results.
Theory questions
18. What are the Pros and Cons of using Naive Bayes?
The requirement of predictors to be independent. In most of the real life cases, the
predictors are dependent, this hinders the performance of the classifier.
19. How does the Bayes algorithm differ from decision trees?
become harder to understand, and its complex to implement these algorithms. Other
techniques like boosting and random forest decision trees can perform quite well, and
some feel these techniques are essential to get the best performance out of decision
trees. Again this adds more things to understand and use to tune the tree and hence
more things to implement. In the end the more we add to the algorithm the taller the
barrier to using it.
Naive Bayes requires you build a classification by hand. There's not way to just toss a
bunch of tabular data at it and have it pick the best features it will use to classify. Picking
which features matter is up to you. Decisions trees will pick the best features for you
from tabular data. If there were a way for Naive Bayes to pick features you'd be getting
close to using the same techniques that make decision trees work like that. Give this fact
that means you may need to combine Naive Bayes with other statistical techniques to
help guide you towards what features best classify and that could be using decision
trees. Naive bayes will answer as a continuous classifier. There are techniques to adapt it
to categorical prediction however they will answer in terms of probabilities like (A 90%, B
5%, C 2.5% D 2.5%) Bayes can perform quite well, and it doesn't over fit nearly as much
so there is no need to prune or process the network. That makes them simpler
algorithms to implement. However, they are harder to debug and understand because
it's all probabilities getting multiplied 1000's of times so you have to be careful to test
it's doing what you expect. Naive bayes does quite well when the training data doesn't
contain all possibilities so it can be very good with low amounts of data. Decision trees
work better with lots of data compared to Naive Bayes.
Naive Bayes is used a lot in robotics and computer vision, and does quite well with those
tasks. Decision trees perform very poorly in those situations. Teaching a decision tree to
recognize poker hands by looking a millions of poker hands does very poorly because
royal flushes and quads occurs so little it often gets pruned out. If it's pruned out of the
resulting tree it will misclassify those important hands (recall tall trees discussion from
above). Now just think if you are trying to diagnose cancer using this. Cancer doesn't
occur in the population in large amounts, and it will get pruned out more likely. Good
news is this can be handled by using weights so we weight a winning hand or having
cancer as higher than a losing hand or not having cancer and that boosts it up the tree
so it won't get pruned out. Again this is the part of tuning the resulting tree to the
situation that I discussed earlier.
Decision trees are neat because they tell you what inputs are the best predicators of the
outputs so often decision trees can guide you to find if there is a statistical relationship
between a given input to the output and how strong that relationship is. Often the
resulting decision tree is less important than relationships it describes. So decision trees
can be used a research tool as you learn about your data so you can build other
classifiers.
Working of Naïve Bayes' Classifier can be understood with the help of the below example.
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps: 1. Convert the given dataset into frequency tables. 2. Generate
Likelihood table by finding the probabilities of given features. 3. Now, use Bayes theorem to
calculate the posterior probability. Problem: If the weather is sunny, then the Player should
play or not? Solution: To solve this, first consider the below dataset:
SN Outlook Play SN Outlook Play SN Outlook Play SN Outlook Play
0 Rainy Yes 4 Sunny No 8 Rainy No 12 Overcast Yes
1 Sunny Yes 5 Rainy Yes 9 Sunny No 13 Overcast Yes
2 Overcast Yes 6 Sunny Yes 10 Sunny Yes
3 Overcast Yes 7 Overcast Yes 11 Rainy No
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not
affect the other. Hence it is called naive.
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Problems/Numerical
Problem 1:
Consider a car theft example. The attributes are Colour, Type, Origin, and the subject, stolen
can be either yes or no. Use Naive Bayes Classifier to classify a ―Red Domestic SUV‖.
Dataset is as below.
Solution:
Note there is no example of a Red Domestic SUV in our data set. We need to calculate the
probabilities P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) , P(Red|No) , P(SUV|No), and P(Domestic|No) and
multiply them by P(Yes) and P(No) respectively .
Problem 2:
Problem 3:
Theory questions
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help
in creating the hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or hyperplane:
SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable.
Original dataset Data with separator added Transformed data
A separator between the categories is found, and then the data are transformed in such a
way that the separator could be drawn as a hyperplane. Following this, characteristics of new
data can be used to predict the group to which a new record should belong. For example,
consider the following figure, in which the data points fall into two different categories. The
two categories can be separated with a curve, as shown in the figure. After the
transformation, the boundary between the two categories can be defined by a hyperplane,
as shown in the following figure.
The mathematical function used for the transformation is known as the kernel function.
Following are the popular functions.
Linear
Polynomial
Radial basis function (RBF)
Sigmoid
A linear kernel function is recommended when linear separation of the data is
straightforward. In other cases, one of the other functions should be used. You will need to
experiment with the different functions to obtain the best model in each case, as they each
use different algorithms and parameters.
26. Explain Hard and soft margins with the help of sketch
The distance of the vectors from the hyperplane is called the margin which is a separation of
a line to the closest class points. We would like to choose a hyperplane that maximizes the
margin between classes. The graph below shows what good margins and bad margins are.
Again Margin can be sub-divided into,
Soft Margin – As most of the real-world data are not fully linearly separable, we will
allow some margin violation to occur which is called soft margin classification. It is better
to have a large margin, even though some constraints are violated. Margin violation
means choosing a hyperplane, which can allow some data points to stay on either the
incorrect side of the hyperplane and between the margin and correct side of the
hyperplane.
2. Hard Margin – If the training data is linearly separable, we can select two parallel
hyperplanes that separate the two classes of data, so that the distance between them is
as large as possible.
Support Vector Machines are part of the supervised learning model with an associated
learning algorithm. It is the most powerful and flexible algorithm used for classification,
regression, and detection of outliers. It is used in case of high dimension spaces; where each
data item is plotted as a point in n-dimension space such that each feature value
corresponds to the value of specific coordinate. The classification is made on the basis of a
hyperplane/line as wide as possible, which distinguishes between two categories more
clearly. Basically, support vectors are the observational points of each individual, whereas the
support vector machine is the boundary that differentiates one class from another class.
Some significant terminology of SVM is given below:
Support Vectors: These are the data point or the feature vectors lying nearby to the
hyperplane. These help in defining the separating line.
Hyperplane: It is a subspace whose dimension is one less than that of a decision plane.
It is used to separate different objects into their distinct categories. The best hyperplane
is the one with the maximum separation distance between the two classes.
Margins: It is defined as the distance (perpendicular) from the data point to the decision
boundary. There are two types of margins: good margins and margins. Good margins
are the one with huge margins and the bad margins in which the margin is minor.
The main goal of SVM is to find the maximum marginal hyperplane, so as to segregate the
dataset into distinct classes. It undergoes the following steps:
Firstly the SVM will produce the hyperplanes repeatedly, which will separate out the class
in the best suitable way.
Then we will look for the best option that will help in correct segregation.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image: So as it is 2-d space so by just using a straight line, we
can easily separate these two classes. But there can be multiple lines that can
separate these classes. Consider the below image. Hence, the SVM algorithm helps to
find the best line or decision boundary; this best boundary or region is called as a
hyperplane. SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The
hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
29. What are advantages and limitations of the Support Vector Machine
Advantages
SVM‘s are very good when we have no idea on the data.
Works well with even unstructured and semi structured data like text, Images and trees.
The kernel trick is real strength of SVM. With an appropriate kernel function, we can
solve any complex problem.
Unlike in neural networks, SVM is not solved for local optima.
It scales relatively well to high dimensional data. SVM is more effective in high
dimensional spaces.
SVM models have generalization in practice; the risk of over-fitting is less in SVM.
SVM works relatively well when there is a clear margin of separation between classes.
SVM is effective in cases where the number of dimensions is greater than the number of
samples.
SVM is relatively memory efficient
Disadvantages
Choosing a ―good‖ kernel function is not easy.
SVM algorithm is not suitable for large data sets.
Long training time for large datasets.
Difficult to understand and interpret the final model, variable weights and individual
impact.
Since the final model is not so easy to see, we cannot do small calibrations to the model
hence it‘s tough to incorporate our business logic.
The SVM hyper parameters are Cost -C and gamma. It is not that easy to fine-tune these
hyper-parameters. It is hard to visualize their impact
SVM does not perform very well when the data set has more noise i.e. target classes are
overlapping.
In cases where the number of features for each data point exceeds the number of
training data samples, the SVM will underperform.
As the support vector classifier works by putting data points, above and below the
classifying hyperplane there is no probabilistic explanation for the classification.
Two different examples of this approach are the One-vs-Rest and One-vs-One strategies.
Binary classification models like logistic regression and SVM do not support multi-class
classification natively and require meta-strategies.
The One-vs-Rest strategy splits a multi-class classification into one binary classification
problem per class.
The One-vs-One strategy splits a multi-class classification into one binary classification
problem per each pair of classes.
Hyper parameters of SVM are considered as Kernel, Regularization, Gamma and Margin.
Kernel: The learning of the hyperplane in linear SVM is done by transforming the problem
using some linear algebra. This is where the kernel plays role.
For linear kernel the equation for prediction for a new input using the dot product between
the input (x) and each support vector (xi) is calculated as follows:
f(x) = B(0) + sum(ai * (x,xi))
This is an equation that involves calculating the inner products of a new input vector (x) with
all support vectors in training data. The coefficients B0 and ai (for each input) must be
estimated from the training data by the learning algorithm.
The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d and exponential as
K(x,xi) = exp(-gamma * sum((x — xi²)).
Polynomial and exponential kernels calculates separation line in higher dimension. This is
called kernel trick.
Regularization: The Regularization parameter (often termed as C parameter in python‘s
sklearn library) tells the SVM optimization how much you want to avoid misclassifying each
training example. For large values of C, the optimization will choose a smaller-margin
hyperplane if that hyperplane does a better job of getting all the training points classified
correctly. Conversely, a very small value of C will cause the optimizer to look for a larger-
margin separating hyperplane, even if that hyperplane misclassifies more points. The images
below are example of two different regularization parameters. Left one has some
misclassification due to lower regularization value. Higher value leads to results like right
one.
High Gamma
Low Gamma
Margin: And finally last but very important characteristic of SVM classifier. SVM to core tries
to achieve a good margin.
A margin is a separation of line to the closest class points.
A good margin is one where this separation is larger for both the classes. Images below
gives to visual example of good and bad margin. A good margin allows the points to be in
their respective classes without crossing to other class.
Problems/Numerical
In Support Vector Machine, there is the word vector. That means it is important to
understand vector well and how to use them.
What is a vector?
o its norm
o its direction
How to add and subtract vectors?
What is the dot product?
How to project a vector onto another?
What is the equation of the hyperplane?
How to compute the margin?
What is a vector?
If we define a point A(3,4) in we can plot it like this.
Definition: Any point x=(x1, x2),x≠0, in specifies a vector in the plane, namely the vector
starting at the origin and ending at x.
This definition means that there exists a vector between the origin and A.
If we say that the point at the origin is the point O(0,0) then the vector above is the vector
⃗⃗⃗⃗⃗⃗ . We could also give it an arbitrary name such as u.
Note:
You can notice that we write vector either with an arrow on top of them, or in bold, in the
rest of this text I will use the arrow when there is two letters like ⃗⃗⃗⃗⃗⃗ and the bold notation
otherwise.
Ok so now we know that there is a vector, but we still don't know what IS a vector.
Definition: A vector is an object that has both a magnitude and a direction.
We will now look at these two concepts.
1) The magnitude
The magnitude or length of a vector x is written ∥x∥ and is called its norm.
For our vector ⃗⃗⃗⃗⃗⃗ , ∥OA∥ is the length of the segment OA
From Figure, we can easily calculate the distance OA using Pythagoras' theorem:
OA2=OB2+AB2
OA2=32+42
OA2=25
OA=5
∥OA∥=5
2) The direction
The direction is the second component of a vector.
Definition: The direction of a vector u(u1,u2) is the vector ‖ ‖ ‖ ‖
cos(α) = ‖ ‖
Hence the original definition of the vector w. That's why its coordinates are also called
direction cosine.
Computing the direction vector
We will now compute the direction of the vector u from Figure 4.
cos(θ) = ‖ ‖
= 3/5 = 0.6
cos(α) = ‖ ‖
= 4/5 = 0.8
We can see that ‗w’ as indeed the same look as u except it is smaller. Something interesting
about direction vectors like w is that their norm is equal to 1. That's why we often call them
unit vectors.
The sum of two vectors
Since the subtraction is not commutative, we can also consider the other case:
v−u=(v1−u1,v2−u2)
The last two pictures describe the "true" vectors generated by the difference of u and v.
However, since a vector has a magnitude and a direction, we often consider that parallel
translate of a given vector (vectors with the same magnitude and direction but with a
different origin) are the same vector, just drawn in a different place in space.
So don't be surprised if you meet the following:
If you do the math, it looks wrong, because the end of the vector u−v is not in the right
point, but it is a convenient way of thinking about vectors which you'll encounter often.
The dot product
One very important notion to understand SVM is the dot product.
Definition: Geometrically, it is the product of the Euclidian magnitudes of the two vectors
and the cosine of the angle between them
Which means if we have two vectors x and y and there is an angle θ (theta) between them,
their dot product is:
x⋅y=∥x∥∥y∥cos(θ)
Why?
To understand let's look at the problem geometrically.
In the definition, they talk about cos(θ), let's see what it is.
By definition we know that in a right-angled triangle:
cos(θ)=adjacent/hypotenuse
In our example, we don't have a right-angled triangle.
However if we take a different look Figure 12 we can find two right-angled triangles formed
by each vector with the horizontal axis.
We can view point A as a vector from origin to A. If we project it onto the normal vector w
Theory questions
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).The curve from the logistic
function indicates the likelihood of something such as whether the cells are cancerous or
not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it
is called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
Advantages Disadvantages
Logistic regression is easier to implement, If the number of observations is lesser than
interpret, and very efficient to train. the number of features, Logistic Regression
should not be used; otherwise, it may lead to
overfitting.
It makes no assumptions about distributions It constructs linear boundaries.
of classes in feature space.
It can easily extend to multiple classes Limitation of Logistic Regression is the
(multinomial regression) and a natural assumption of linearity between dependent
probabilistic view of class predictions. variable and independent variables.
It not only provides a measure of how It can only be used to predict discrete
appropriate a predictor (coefficient size)is, functions. Hence, the dependent variable of
but also its direction of association (positive Logistic Regression is bound to the discrete
or negative). number set.
It is very fast at classifying unknown records. Non-linear problems can‘t be solved with
logistic regression because it has a linear
decision surface. Linearly separable data is
rarely found in real-world scenarios.
Good accuracy for many simple data sets Logistic Regression requires average or no
and it performs well when the dataset is multicollinearity between independent
linearly separable. variables.
It can interpret model coefficients as It is tough to obtain complex relationships
indicators of feature importance. using logistic regression. More powerful and
compact algorithms such as Neural Networks
can easily outperform this algorithm.
Logistic regression is less inclined to over- In Linear Regression independent and
fitting but it can overfit in high dimensional dependent variables are related linearly. But
datasets.One may consider Regularization Logistic Regression needs that independent
(L1 and L2) techniques to avoid over-fittingin variables are linearly related to the log odds
these scenarios. (log(p/(1-p)).
Here dependent variable should be Here the dependent variable consists of only two
numeric and the response variable is categories. Logistic regression estimates the odds
continuous to value. outcome of the dependent variable given a set of
quantitative or categorical independent variables.
It is based on the least square It is based on maximum likelihood estimation.
estimation.
Here when we plot the training Any change in the coefficient leads to a change in
datasets, a straight line can be drawn both the direction and the steepness of the logistic
that touches maximum plots. function. It means positive slopes result in an S-
shaped curve and negative slopes result in a Z-
shaped curve.
Linear regression is used to estimate Whereas logistic regression is used to calculate the
the dependent variable in case of a probability of an event. For example, classify if
change in independent variables. For tissue is benign or malignant.
example, predict the price of houses.
Linear regression assumes the normal Logistic regression assumes the binomial
or gaussian distribution of the distribution of the dependent variable.
dependent variable.
The i indexes have been removed for clarity. In words this is the cost the algorithm pays if it
predicts a value hθ(x) while the actual cost label turns out to be y. By using this function we
will grant the convexity to the function the gradient descent algorithm has to process, as
discussed above. There is also a mathematical proof for that, which is outside the scope of
this introductory course. In case y=1, the output (i.e. the cost to pay) approaches to 0 as
hθ(x) approaches to 1. Conversely, the cost to pay grows to infinity as hθ(x) approaches to 0.
This is a desirable property: we want a bigger penalty as the algorithm predicts something
far away from the actual value. If the label is y=1 but the algorithm predicts hθ(x)=0, the
outcome is completely wrong. Conversely, the same intuition applies when y=0, depicted in
the plot 2. below, right side. Bigger penalties when the label is y=0 but the algorithm
predicts hθ(x)=1. The above two functions can be compressed into a single function i.e.
Theory questions
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the
neighbour is maximum.
Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider
the below image:
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters. The below diagram explains the working of the K-means Clustering Algorithm:
Let's take number k of clusters, K=2, to identify dataset and to put them into different
clusters. It means here we will try to group these datasets into 2 different clusters.
We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:
Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:\
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learning algorithm
that can be used to solve both classification and regression problems. It‘s easy to implement
and understand, but has a major drawback of becoming significantly slows as the size of that
data in use grows. KNN works by finding the distances between a query and all the examples
in the data, selecting the specified number examples (K) closest to the query, then votes for
the most frequent label (in the case of classification) or averages the labels (in the case of
regression). In the case of classification and regression, we saw that choosing the right K for
our data is done by trying several Ks and picking the one that works best.
Advantages
The algorithm is simple and easy to implement.
There‘s no need to build a model, tune several parameters, or make additional
assumptions.
The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
Disadvantages
The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
To select the K that‘s right for your data, we run the KNN algorithm several times with
different values of K and choose the K that reduces the number of errors we encounter while
maintaining the algorithm‘s ability to accurately make predictions when it‘s given data it
hasn‘t seen before. Here are some things to keep in mind:
As we decrease the value of K to 1, our predictions become less stable. Just think for a
minute, imagine K=1 and we have a query point surrounded by several reds and one
green (I‘m thinking about the top left corner of the colored plot above), but the green is
the single nearest neighbor. Reasonably, we would think the query point is most likely
red, but because K=1, KNN incorrectly predicts that the query point is green.
Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions (up
to a certain point). Eventually, we begin to witness an increasing number of errors. It is at
this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification
problem) among labels, we usually make K an odd number to have a tiebreaker.
49. How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values and the number of clusters K.
The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.
*********************
Theory questions
approach?
Regression
If the prediction value tends to be a continuous value then it falls under Regression
type problem in machine learning. Giving area name, size of land, etc. as features and
predicting expected cost of the land.
Classification
If the prediction value tends to be category/discrete like yes/no , positive/negative , etc.
then it falls under classification type problem in machine learning. Given a sentence
predicting whether it is negative or positive review
Clustering
Grouping a set of points to given number of clusters. Given 3, 4, 8, 9 and number of
clusters to be 2 then the ML system might divide the given set into cluster 1 - 3, 4 and
cluster 2 - 8, 9
Ranking
Used for constructing a ranker from a set of labelled examples. This example set
consists of instance groups that can be scored with a given criteria. The ranking labels
are { 0, 1, 2, 3, 4 } for each instance. The ranker is trained to rank new instance groups
with unknown scores for each instance.
pictorially.
Clustering the groups is defined by “distance” among sample cases and at the end,
researcher looks for some meaning of that grouping. Regression, classification and
clustering are based on sample content and the result reflects that sample.
To extrapolate the conclusion to population requires validation according to the Level of
Confidence you are willing to assume and additional math processes need to be done.
Regression Classification
In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.
The task of the regression algorithm is The task of the classification algorithm is to map
to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).
Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.
In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.
model.
Regression Analysis is an analytical process whose end goal is to understand the inter-
relationships in the data and find as much useful information as possible.
According to the book, there are a number of steps which are loosely detailed below.
1 - Problem definition
The very first step is to, off course; define the problem we are trying to solve. Perhaps a
business question that needs to be answered or simply a prediction we want to make
based on some set of data. In this stage we must know the target variable and the
attributes we presume affects the target variable. This would be later analysed to judge
its credibility. For the sake of our discussion let‟s take the Titanic Dataset as an example.
In this dataset we have data of about 900 passengers. The question or the problem we
must solve is predicting which passenger likely survived the tragedy given their data.
2 - Analyse Data
The key is to have visual representations of our data so we can better understand the „inter-
relationships‟ of the variables and likely so, the book I was referring to earlier, highly
recommends using visual tools to make the EDA(Exploratory Data Analysis) process easier.
For the afore-mentioned dataset, we could try answering a number of things that might give
us a better understanding of the problem at hand. What‟s the survival rate of passengers
from each class?
minimizes the sum of the square of the errors as small as possible given that no outliers are
present in the data.
5 - Model evaluation
Final step is model evaluation - measuring and criticising exactly how well is the model
fitting the data points. We run the model on the test data and check to see how accurately it
was able to predict the output values. Now, there are a number of measures to check this as
discussed below:
i) We can find RMSE (root mean squared error) of the actual Y values and predicted Y values.
There are other variations of it that can be explored.
There are many other methods, some more complex than others but these are usually a
good place to start. Based on this analysis, the model is updated and perfected after which it
can be used for its intended purpose.
9. What are the sources of the data that is needed for training the
milling/drilling/lathe?
10. What are the sources of the data that is needed for training the
elements?
11. What are the sources of the data that is needed for training the ML model
12. What are the sources of the data that is needed for training the ML model
13. You have given a task of developing a classification model for identifying
data corresponds to tool state either as healthy or faulty. So what are the
14. What is training data? What is labeled data? What is unlabeled data? What
Machine learning models are as good as the data they're trained on. Without
high-quality training data, even the most efficient machine learning algorithms
will fail to perform.
The need for quality, accurate, complete, and relevant data starts early on in the
training process.
Only if the algorithm is fed with good training data can it easily pick up the
features and find relationships that it needs to predict down the line.
More precisely, quality training data is the most significant aspect of machine
learning (and artificial intelligence) than any other.
If you introduce the machine learning (ML) algorithms to the right data, you're
setting them up for accuracy and success.
Training data is the initial dataset used to train machine learning algorithms.
Models create and refine their rules using this data. It's a set of data samples used
to fit the parameters of a machine learning model to training it by example.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL
Training data is also known as training dataset, learning set, and training set. It's
an essential component of every machine learning model and helps them make
accurate predictions or perform a desired task.
Simply put, training data builds the machine learning model. It teaches what the
expected output looks like. The model analyzes the dataset repeatedly to deeply
understand its characteristics and adjust itself for better performance.
In a broader sense, training data can be classified into two categories: labeled
data and unlabeled data.
Labeled data is a group of data samples tagged with one or more meaningful
labels. It's also called annotated data, and its labels identify specific
characteristics, properties, classifications, or contained objects.
For example, the images of fruits can be tagged as apples, bananas, or grapes.
Labeled training data is used in supervised learning. It enables ML models to
learn the characteristics associated with specific labels, which can be used to
classify newer data points. In the example above, this means that a model can use
labeled image data to understand the features of specific fruits and use this
information to group new images.
Data labeling or annotation is a time-consuming process as humans need to tag
or label the data points. Labeled data collection is challenging and expensive. It
isn't easy to store labeled data when compared to unlabeled data.
As expected, unlabeled data is the opposite of labeled data. It's raw data or data
that's not tagged with any labels for identifying classifications, characteristics, or
properties. It's used in unsupervised machine learning, and the ML models have
to find patterns or similarities in the data to reach conclusions.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL
Going back to the previous example of apples, bananas, and grapes, in unlabeled
training data, the images of those fruits won't be labeled. The model will have to
evaluate each image by looking at its characteristics, such as color and shape.
After analyzing a considerable number of images, the model will be able to
differentiate new images (new data) into the fruit types of apples, bananas, or
grapes. Of course, the model wouldn't know that the particular fruit is called an
apple. Instead, it knows the characteristics needed to identify it.
There are hybrid models that use a combination of supervised and unsupervised
machine learning.
For machine learning models, historical data is fodder. Just as humans rely on past
experiences to make better decisions, ML models look at their training dataset with past
observations to make predictions.
Predictions could include classifying images as in the case of image recognition, or
understanding the context of a sentence as in natural language processing (NLP).
Think of a data scientist as a teacher, the machine learning algorithm as the student, and
the training dataset as the collection of all textbooks.
The teacher‟s aspiration is that the student must perform well in exams and also in the
real world. In the case of ML algorithms, testing is like exams. The textbooks (training
dataset) contain several examples of the type of questions that‟ll be asked in the exam.
Of course, it won‟t contain all the examples of questions that‟ll be asked in the exam, nor
will all the examples included in the textbook will be asked in the exam. The textbooks
can help prepare the student by teaching them what to expect and how to respond.
No textbook can ever be fully complete. As time passes, the kind of questions asked will
change, and so, the information included in the textbooks needs to be changed. In the
case of ML algorithms, the training set should be periodically updated to include new
information.
In short, training data is a textbook that helps data scientists give ML algorithms an idea
of what to expect. Although the training dataset doesn't contain all possible examples,
it‟ll make algorithms capable of making predictions.
High-quality data translates to accurate machine learning models. Low-quality data can
significantly affect the accuracy of models, which can lead to severe financial losses. It's
almost like giving a student textbook containing wrong information and expecting them to
excel in the examination. The following are the four primary traits of quality training data.
Relevant
The data needs to be relevant to the task at hand. For example, if you want to train
a computer vision algorithm for autonomous vehicles, you probably won't require images of
fruits and vegetables. Instead, you would need a training dataset containing photos of roads,
sidewalks, pedestrians, and vehicles.
Representative
The AI training data must have the data points or features that the application is made to
predict or classify. Of course, the dataset can never be absolute, but it must have at least the
attributes the AI application is meant to recognize. For example, if the model is meant to
recognize faces within images, it must be fed with diverse data containing people's faces
from various ethnicities. This will reduce the problem of AI bias, and the model won't be
prejudiced against a particular race, gender, or age group.
Uniform
All data should have the same attribute and must come from the same source. Suppose your
machine learning project aims to predict churn rate by looking at customer information. For
that, you'll have a customer information database that includes customer name, address,
number of orders, order frequency, and other relevant information. This is historical data and
can be used as training data. One part of the data can't have additional information, such as
age or gender. This will make training data incomplete and the model inaccurate. In short,
uniformity is a critical aspect of quality training data.
Comprehensive
Again, the training data can never be absolute. But it should be a large dataset that
represents the majority of the model's use cases. The training data must have enough
examples that‟ll allow the model to learn appropriately. It must contain real-world data
samples as it will help train the model to understand what to expect. If you're thinking of
training data as values placed in large numbers of rows and columns, sorry, you're wrong. It
could be any data type like text, images, audio, or videos.
Humans are highly social creatures, but there are some prejudices that we might have picked
as children and require constant conscious effort to get rid of. Although unfavourable, such
biases may affect our creations, and machine learning applications are no different. For ML
models, training data is the only book they read. Their performance or accuracy will depend
on how comprehensive, relevant, and representative the very book is. That being said, three
factors affect the quality of training data:
People: The people who train the model have a significant impact on its accuracy or
performance. If they're biased, it‟ll naturally affect how they tag data and, ultimately,
how the ML model functions.
Processes: The data labeling process must have tight quality control checks in place.
This will significantly increase the quality of training data.
Tools: Incompatible or outdated tools can make data quality suffer. Using robust data
labeling software can reduce the cost and time associated with the process.
There isn't a specific answer to how much training data is enough training data. It
depends on the algorithm you're training – its expected outcome, application,
complexity, and many other factors.
Suppose you want to train a text classifier that categorizes sentences based on the
occurrence of the terms "cat" and "dog" and their synonyms such as "kitty," "kitten,"
"pussycat," "puppy," or "doggy". This might not require a large dataset as there are only
a few terms to match and sort.
But, if this was an image classifier that categorized images as "cats" and "dogs," the
number of data points needed in the training dataset would shoot up significantly. In
short, many factors come into play to decide what training data is enough training data.
The amount of data required will change depending on the algorithm used.
For context, deep learning, a subset of machine learning, requires millions of data points
to train the artificial neural networks (ANNs). In contrast, machine learning algorithms
require only thousands of data points. But of course, this is a far-fetched generalization
as the amount of data needed varies depending on the application.
The more you train the model, the more accurate it becomes. So it's always better to
have a large amount of data as training data.
Garbage in, garbage out
The phrase "garbage in, garbage out" is one of the oldest and most used phrases in data
science. Even with the rate of data generation growing exponentially, it still holds true.
The key is to feed high-quality, representative data to machine learning algorithms.
Doing so can significantly enhance the accuracy of models. Good quality training data is
also crucial for creating unbiased machine learning applications.
19. How should you split up a dataset into test and training sets?
When we are working on the model development, we need to train it and test it on the
same dataset. Since it is challenging to possess a vast number of data while the model is
in the development phase, the most obvious answer is to split the data into two separate
sets, out of which one will be for training and the other will be testing.
A. Splitting the data set into a training set and test set:
The two conditions that need to be taken care of before proceeding with the splitting of the
dataset:
The test set needs to be large to give statistically essential outputs.
The characteristics of the training set and test set should be similar.
Therefore, after the satisfaction of the above two conditions, the ultimate goal should be to
develop a model that can easily perform functions with the new dataset.
B. Validation of the trained model over the test data.
The model should not train over the test data. Many times good results on evaluation
metrics are an indication that inadvertently, you are training on test data.
A test set in machine learning is a secondary (or tertiary) data set that is used to test a
machine learning program after it has been trained on an initial training data set. The
idea is that predictive models always have some sort of unknown capacity that needs to
be tested out, as opposed to analysed from a programming perspective.
A test set is also known as a test data set or test data.
In machine learning, a validation set is used to “tune the parameters” of a classifier. The
validation test evaluates the program‟s capability according to the variation of
parameters to see how it might function in successive testing.
The validation set is also known as a validation data set, development set or dev set.
22. Compare Training data vs. test data vs. validation data.
Training data is used in model training, or in other words, it's the data used to fit the
model. On the contrary, test data is used to evaluate the performance or accuracy of the
model. It's a sample of data used to make an unbiased evaluation of the final model fit
on the training data.
A training dataset is an initial dataset that teaches the ML models to identify desired
patterns or perform a particular task. A testing dataset is used to evaluate how effective
the training was or how accurate the model is.
Once an ML algorithm is trained on a particular dataset and if you test it on the same
dataset, it's more likely to have high accuracy because the model knows what to expect.
If the training dataset contains all possible values the model might encounter in the
future, all well and good.
But that's never the case. A training dataset can never be comprehensive and can't teach
everything that a model might encounter in the real world. Therefore a test dataset,
containing unseen data points, is used to evaluate the model's accuracy.
Then there's validation data. This is a dataset used for frequent evaluation during the
training phase. Although the model sees this dataset occasionally, it doesn't learn from
it. The validation set is also referred to as the development set or dev set. It helps protect
models from overfitting and underfitting.
Although validation data is separate from training data, data scientists might reserve a
part of the training data for validation. But of course, this automatically means that the
validation data was kept away during the training.
Many use the terms "test data" and "validation data" interchangeably. The main
difference between the two is that validation data is used to validate the model during
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL
the training, while the testing set is used to test the model after the training is
completed.
The validation dataset gives the model the first taste of unseen data. However, not all
data scientists perform an initial check using validation data. They might skip this part
and go directly to testing data.
The classifier model can be designed/trained and performance can be evaluated based
on K-fold cross-validation mode, training mode and test mode.
The main idea behind K-Fold cross-validation is that each sample in our dataset has
the opportunity of being tested. It is a special case of cross-validation where we
iterate over a dataset set k times. In each round, we split the dataset into k parts:
one part is used for validation, and the remaining k-1 parts are merged into a
training subset for model evaluation
Computation time is reduced as we repeated the process only 10 times when the value
of k is 10. It has Reduced bias.
Every data points get to be tested exactly once and is used in training k-1 times
The variance of the resulting estimate is reduced as k increases
• Machine learning algorithms have hyperparameters that allow you to tailor the
behavior of the algorithm to your specific dataset.
• Hyperparameters are different from parameters, which are the internal coefficients or
weights for a model found by the learning algorithm. Unlike parameters,
hyperparameters are specified by the practitioner when configuring the model.
• Typically, it is challenging to know what values to use for the hyperparameters of a given
algorithm on a given dataset, therefore it is common to use random or grid search
strategies for different hyperparameter values.
• The more hyperparameters of an algorithm that you need to tune, the slower the
tuning process. Therefore, it is desirable to select a minimum subset of model
hyperparameters to search or tune.
Max_Depth: The maximum depth of the tree. If this is not specified in the Decision Tree, the
nodes will be expanded until all leaf nodes are pure or until all leaf nodes contain less than
min_samples_split.
Default = None
Input options → integer
Min_Samples_Split: The minimum samples required to split an internal node. If the amount
of sample in an internal node is less than the min_samples_split, then that node will become
a leaf node.
Default = 2
Input options → integer or float (if float, then min_samples_split is fraction)
Min_Samples_Leaf: The minimum samples required to be at a leaf node. Therefore, a split
can only happen if it leaves at least the min_samples_leaf in both of the resulting nodes.
Default = 1
Input options → integer or float (if float, then min_samples_leaf is fraction)
Max_Features: The number of features to consider when looking for the best split. For
example, if there are 35 features in a dataframe and max_features is 9, only 9 of the 35
features will be used in the decision tree.
Default = None
Input options → integer, float (if float, then max_features is fraction) or {“auto”, “sqrt”,
“log2”}
“auto”: max_features=sqrt(n_features)
“sqrt”: max_features = sqrt(n_features)
“log2”: max_features=log2(n_features)
The choice of kernel that will control the manner in which the input variables will
be projected. There are many to choose from, but linear, polynomial, and RBF are the
most common, perhaps just linear and RBF in practice.
kernels in [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
If the polynomial kernel works out, then it is a good idea to dive into the degree
hyperparameter.
Another critical parameter is the penalty (C) that can take on a range of values and has a
dramatic effect on the shape of the resulting regions for each class. A log scale might be
a good starting point.
C in [100, 10, 1.0, 0.1, 0.001]
Number of neurons: A weight is the amplification of input signals to a neuron and bias
is an additive bias term to a neuron.
Activation function: Defines how a neuron or group of neurons activate ("spiking")
based on input connections and bias term(s).
Learning rate: Step length for gradient descent update
Batch size: Number of training examples in each gradient descent (gd) update.
Epochs: The number of times all training examples have been passed through the
network during training.
Loss function: Loss function specifies how to calculate the error between prediction and
label for a given training example. The error is backpropagated during training in order
to update learnable parameters.
Number of layers: Typically layers between input and output layer, which are called
hidden layers
CLASSIFICATION MODEL:
Confusion matrix: A Confusion matrix is an N x N matrix used for evaluating the
performance of a classification model, where N is the number of target classes. The
matrix compares the actual target values with those predicted by the machine
learning model
What can we learn from this matrix?
• There are two possible predicted classes: "yes" and "no". If we were predicting the
presence of a disease, for example, "yes" would mean they have the disease, and "no"
would mean they don't have the disease.
• The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
• Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
• In reality, 105 patients in the sample have the disease, and 60 patients do not.
True positives (TP): these are cases in which we predicted yes (they have the disease),
and they do have the disease.
True negatives (tn): we predicted no, and they don't have the disease.
False positives (fp): we predicted yes, but they don't actually have the disease. (Also
known as a "type I error.")
False negatives (fn): we predicted no, but they actually do have the disease. (Also
known as a "type II error.")
REGRESSION MODEL:
Mean Absolute Error (MAE)
The mean absolute error (MAE) is defined the MAE as
Where is the actual value is the predicted value and is the absolute value
of the difference between the actual and predicted value.
N is the number of sample points.
Let's dig into this a bit deeper to understand what this calculation represents.
Take a look at the following plot, which shows the number of failures for a piece of
machinery against the age of the machine:
In order to predict the number of failures from the age, we would want to fit a regression
line such as this:
In order to understand how well this line represents the actual data, we need to measure
how good a fit it is. We can do this by measuring the distance from the actual data points to
the line:
You may recall that these distances are called residuals or errors. The mean size of these
errors is the MAE. We can calculate it as follows:
The mean of the absolute errors (MAE) is 8.5. Why do we take the absolute value? To
remove the sign on the error value! If we don't, the positive and negative errors will tend to
cancel each other out, giving a misleadingly small value for our evaluation metric. If
mathematical symbols are not your strong point, you may not immediately see how this
calculation relates to the formula at the start of this chapter:
Mean Absolute Error (MAE) tells us the average error in units of y, the predicted feature. A
value of 0 indicates a perfect fit, i.e. all our predictions are spot on.
The MAE has a big advantage in that the units of the MAE are the same as the units of y,
the feature we want to predict. In the example above, we have an MAE of 8.5, so it means
that on average our predictions of the number of machine failures are incorrect by 8.5
machine failures. This makes MAE very intuitive and the results are easily conveyed to a non-
machine learning expert!
Root Mean Square Error (RMSE)
Another evaluation metric for regression is the root mean square error (RMSE). Its
calculation is very similar to MAE, but instead of taking the absolute value to get rid of the
sign on the individual errors, we square the error (because the square of a negative number
is positive). The formula for RMSE is:
As with MAE, we can think of RMSE as being measured in the y units. So the above error
can be read as an error of 9.9 machine failures on average per observation.
MAE vs. RMSE
Compared to MAE, RMSE gives a higher total error and the gap increases as the errors
become larger. It penalizes a few large errors more than a lot of small errors. If you want
your model to avoid large errors, use RMSE over MAE.
Root Mean Square Error (RMSE) indicates the average error in units of y, the predicted
feature, but penalizes larger errors more severely than MAE. A value of 0 indicates a perfect
fit. You should also be aware that as the sample size increases, the accumulation of slightly
higher RMSEs than MAEs means that the gap between these two measures also increases as
the sample size increases.
R-Squared
As stated above that an advantage of both MAE and RMSE is that they can be thought of as
errors in the units of y, the predicted feature. This is helpful when relaying the results to non-
data scientists.
We can say things like "our model can predict the reliability of our machinery to within 8.5
machine failures on average" or "our model can predict the selling price of a house to within
£15k on average".
But take heed! This advantage can also be considered a disadvantage! It says nothing about
whether an error of 8.5 machine failures or an error of £15k on a house price is good or bad.
We can't compare how good different models are for different scenarios. This is where R-
squared or R2 comes in. Here is the formula for R2.
R2 computes how much better the regression line fits the data than the mean line.
Another way to look at this formula is to compare the variance around the mean line to the
variation around the regression line:
Take our example above, predicting the number of machine failures. We can examine the
errors for our regression line as we did before. We can also compute a mean line (by taking
the mean y (value) and examine the errors against this mean line. That is to say, we can see
the errors we would get if our model just predicted the mean number of failures (50.8) for
every age input. Here are the regression and mean lines, and their respective errors:
You can see that the regression line fits the data better than the mean line, which is what we
expected (the mean line is a pretty simplistic model, after all). But can you say how much
better it is? That's exactly what R2 does! Here is the calculation.
Notice something? Most of this is the same as the calculation of RMSE. The additional parts
to the calculation are the column on the far right (in blue) and the final calculation row,
computing R2. So we have an R-squared of 0.85. Without even worrying about the units of y
we can say this is a decent model. Why? Because the model explains 85% of the variation in
the data. That's exactly what an R-squared of 0.85 tells us!
R-squared (R2) tells us the degree to which the model explains the variance in the data. In
other words, how much better it is than just predicting the mean. Here's another example.
What if our data points and regression line looked like this?
The variance around the regression line is 0. In other words, var(line) is 0. There are no errors.
Now, remember that the formula for R-squared is:
So, if we have a perfect regression line, with no errors, we get an R-squared of 1. Let's look at
another example. What if our data points and regression line looked like this, with the
regression line equal to the mean line?
In this case, var(line) and var(mean) are the same. So the above calculation will yield an R-
squared of 0:
So, if our regression line is only as good as the mean line, we get an R-squared of 0. What if
our regression line was really bad, worse than the mean line?
It's unlikely to get this bad! But if it does, var(mean)-var(line) will be negative, so R-squared
will be negative. An R-squared of 1 indicates a perfect fit. An R-squared of 0 indicates a
model no better or worse than the mean. An R-squared of less than 0 indicates a model
worse than just predicting the mean.
30. Identify methodology to attempt following problems and enlist general steps
involved in it.
*********************
Theory questions
reinforcement learning.)
Agent(): An entity that can perceive/explore the environment and act upon it.
Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
Action(): Actions are the moves taken by an agent within the environment.
State(): State is a situation returned by the environment after each action taken by the
agent.
Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
In RL, the agent is not instructed about the environment and what actions need to be
taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the feedback of the
previous action.
The agent may get a delayed reward.
The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.
OR
There are mainly three ways to implement reinforcement-learning in ML, which are:
Value-based: The value-based approach is about to find the optimal value function,
which is the maximum value at a state under any policy. Therefore, the agent expects the
long-term return at any state(s) under policy π.
Policy-based: Policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach, the agent tries to apply
such a policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
Deterministic: The same action is produced by the policy (π) at any state.
Stochastic: In this policy, probability determines the produced action.
Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no particular
solution or algorithm for this approach because the model representation is different for
each environment.
There are four main elements of Reinforcement Learning, which are given below:
Policy, Reward Signal, Value Function, Model of the environment
Policy: A policy can be defined as a way how an agent behaves at a given time. It maps
the perceived states of the environment to the actions taken on those states. A policy is
the core element of the RL as it alone can define the behavior of the agent. In some
cases, it may be a simple function or a lookup table, whereas, for other cases, it may
involve general computation as a search process. It could be deterministic or a stochastic
policy:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
Reward Signal: The goal of reinforcement learning is defined by the reward signal. At
each state, the environment sends an immediate signal to the learning agent, and this
signal is known as a reward signal. These rewards are given according to the good and
bad actions taken by the agent. The agent's main objective is to maximize the total
number of rewards for good actions. The reward signal can change the policy, such as if
an action selected by the agent leads to low reward, then the policy may change to
select other actions in the future.
Value Function: The value function gives information about how good the situation and
action are and how much reward an agent can expect. A reward indicates the immediate
signal for each good and bad action, whereas a value function specifies the good
state and action for the future. The value function depends on the reward as, without
reward, there could be no value. The goal of estimating values is to achieve more
rewards.
Model: The last element of reinforcement learning is the model, which mimics the
behaviour of the environment. With the help of the model, one can make inferences
about how the environment will behave. Such as, if a state and an action are given, then
a model can predict the next state and reward. The model is used for planning, which
means it provides a way to take a course of action by considering all future situations
before actually experiencing those situations. The approaches for solving the RL
problems with the help of the model are termed as the model-based approach.
Comparatively, an approach without using a model is called a model-free approach.
To understand the working process of the RL, we need to consider two main things:
Environment: It can be anything such as a room, maze, football ground, etc.
Agent: An intelligent agent such as AI robot.
Let's take an example of a maze environment that the agent needs to explore. Consider the
below image:
In the above image, the agent is at the very first block of the maze. The maze is consisting of
an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block,
then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four
actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible
fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-
reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step.
To memorize the steps, it assigns 1 value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which has 1
value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each block
has the same value. So, the above approach is not suitable for the agent to reach the
destination. Hence to solve the problem, we will use the Bellman equation, which is the
main concept behind reinforcement learning.
The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the
year 1953, and hence it is called as a Bellman equation. It is associated with dynamic
programming and used to calculate the values of a decision problem at a certain point by
including the values of previous states.
It is a way of calculating the value functions in dynamic programming or environment that
leads to modern reinforcement learning.
The key-elements used in Bellman equations are:
Action performed by the agent is referred to as "a"
State occurred by performing the action is "s."
The reward/feedback obtained for each good and bad action is "R."
A discount factor is Gamma "γ."
The Bellman equation can be written as:
V(s) = max [R(s,a) + γV(s`)]
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
In the above equation, we are taking the max of the complete values because the agent tries
to find the optimal solution always.
So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no
reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is no
reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
Now, we will move further to the 6th block, and here agent may change the route because it
always tries to find the optimal path. Let's consider from the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a
bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking only
positive rewards, so for this, he will move to upwards only. The complete block values will be
calculated using this formula. Consider the below image:
We can represent the agent state using the Markov State that contains all the required
information from the history. The State St is Markov state if it follows the given condition:
P[St+1 | St ] = P[St +1 | S1,......, St]
The Markov state follows the Markov property, which says that the future is independent of
the past and can only be defined with the present. The RL works on fully observable
environments, where the agent can observe the environment and act for the new state. The
complete process is known as Markov Decision process, which is explained below:
MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
A set of finite States S
A set of finite Actions A
Rewards received after transitioning from state S to state S', due to action a.
Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property: It says that "If the agent is present in the current state S1, performs an
action a1 and move to the state s2, then the state transition from s1 to s2 only depends on
the current state and future action and states do not depend on past actions, rewards, or
states." Or, in other words, as per Markov Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the current state and
do not need to remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,
P) on state S and transition function P. These two components (S and P) can define the
dynamics of the system.
Q-learning is an off policy RL algorithm, which is used for the temporal difference
Learning. The temporal difference learning methods are the way of comparing
temporally successive predictions.
It learns the value function Q (S, a), which means how good to take action "a" at a
particular state "s."
The below flowchart explains the working of Q- learning:
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair
Deep Q Neural Network (DQN):
As the name suggests, DQN is a Q-learning using Neural networks.
For a big state space environment, it will be a challenging and complex task to define
and update a Q-table.
To solve such an issue, we can use a DQN algorithm. Where, instead of defining a Q-
table, neural network approximates the Q-values for each action and state.
Now, we will expand the Q-learning.
Q-Learning Explanation:
Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
The main objective of Q-learning is to learn the policy which can inform the agent
that what actions should be taken for maximizing the reward under what
circumstances.
It is an off-policy RL that attempts to find the best action to take at a current state.
The goal of the agent in Q-learning is to maximize the value of Q.
The value of Q-learning can be derived from the Bellman equation. Consider the Bellman
equation given below:
In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future state.
The agent can go to any direction (Up, Left, or Right), so he needs to decide where to go
for the optimal path. Here agent will take a move as per probability bases and changes
the state. But if we want some exact moves, so for this, we need to make some changes
in terms of Q-value. Consider the below image:
Q represents the quality of the actions at each state. So instead of using a value at each
state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which
action is more lubricative than others, and according to the best Q-value, the agent takes
his next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a
certain state, so the Q -value equation will be:
The Reinforcement Learning and Supervised Learning both are the part of machine learning,
but both types of learnings are far opposite to each other. The RL agents interact with the
environment, explore it, take action, and get rewarded. Whereas supervised learning
algorithms learn from the labeled dataset and, on the basis of the training, predict the
output. The difference table between RL and Supervised learning is given below:
Reinforcement Learning Supervised Learning
RL works by interacting with the Supervised learning works on the existing
environment. dataset.
The RL algorithm works like the human brain Supervised Learning works as when a human
works when making some decisions. learns things in the supervision of a guide.
There is no labeled dataset is present The labeled dataset is present.
No previous training is provided to the Training is provided to the algorithm so that
learning agent. it can predict the output.
RL helps to take decisions sequentially. In Supervised learning, decisions are made
when input is given.
Wayve.ai has successfully applied reinforcement learning to training a car on how to drive in
a day. They used a deep reinforcement learning algorithm to tackle the lane following task.
Their network architecture was a deep network with 4 convolutional layers and 3 fully
connected layers. The example below shows the lane following task. The image in the middle
represents the driver’s perspective.
Industry automation with Reinforcement Learning
In industry reinforcement, learning-based robots are used to perform various tasks. Apart
from the fact that these robots are more efficient than human beings, they can also perform
tasks that would be dangerous for people.
A great example is the use of AI agents by Deepmind to cool Google Data Centers. This led
to a 40% reduction in energy spending. The centers are now fully controlled with the AI
system without the need for human intervention. There is obviously still supervision from
data center experts. The system works in the following way:
Taking snapshots of data from the data centers every five minutes and feeding this to
deep neural networks
It then predicts how different combinations will affect future energy consumptions
Identifying actions that will lead to minimal power consumption while maintaining a set
standard of safety criteria
Sending and implement these actions at the data center
The actions are verified by the local control system.
Reinforcement learning applications in engineering
In the engineering frontier, Facebook has developed an open-source reinforcement
learning platform — Horizon. The platform uses reinforcement learning to optimize large-
scale production systems. Facebook has used Horizon internally:
to personalize suggestions
deliver more meaningful notifications to users
optimize video streaming quality.
Horizon also contains workflows for:
simulated environments
a distributed platform for data preprocessing
training and exporting models in production.
A classic example of reinforcement learning in video display is serving a user a low or high
bit rate video based on the state of the video buffers and estimates from other machine
learning systems. Horizon is capable of handling production-like concerns such as:
deploying at scale, feature normalization, distributed learning, serving and handling datasets
with high-dimensional data and thousands of feature types.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING
Deep learning is a branch of machine learning which is completely based on artificial neural
networks, as neural network is going to mimic the human brain so deep learning is also a
kind of mimic of human brain. In deep learning, we don’t need to explicitly program
everything. The concept of deep learning is not new. It has been around for a couple of years
now. It’s on hype nowadays because earlier we did not have that much processing power
and a lot of data. As in the last 20 years, the processing power increases exponentially, deep
learning and machine learning came in the picture.A formal definition of deep learning is-
neurons. Deep learning is a particular kind of machine learning that achieves great power
and flexibility by learning to represent the world as a nested hierarchy of concepts, with each
concept defined in relation to simpler concepts, and more abstract representations
computed in terms of less abstract ones. In human brain approximately 100 billion neurons
all together this is a picture of an individual neuron and each neuron is connected through
thousands of their neighbours. The question here is how do we recreate these neurons in a
computer. So, we create an artificial structure called an artificial neural net where we have
nodes or neurons. We have some neurons for input value and some for output value and in
between, there may be lots of neurons interconnected in the hidden layer.
Deep Neural Network – It is a neural network with a certain level of complexity (having
multiple hidden layers in between input and output layers). They are capable of
modeling and processing non-linear relationships.
Deep Belief Network(DBN) – It is a class of Deep Neural Network. It is multi-layer belief
networks. Steps for performing DBN:
a. Learn a layer of features from visible units using Contrastive Divergence algorithm.
b. Treat activations of previously trained features as visible units and then learn features
of features.
c. Finally, the whole DBN is trained when the learning for the final hidden layer is
achieved.
Recurrent (perform same task for every element of a sequence) Neural Network –
Allows for parallel and sequential computation. Similar to the human brain (large
feedback network of connected neurons). They are able to remember important things
about the input they received and hence enables them to be more precise.
It could be 2 meters deep or 10 meters deep, but it has ―depth‖. Same with our
neural network, it can have 2 hidden layers or ―thousands‖ hidden layers(yes you
heard that correctly).
So I’d like to just stick with the question of ―How much deep?‖ for the time being.
Why are the hidden layers?
They are called hidden because they do not see the original inputs( the training set ).
For example, let’s say you have a NN with an input layer, one hidden layer, and an
output layer.
When asked how many layers your NN has, your answer should be ―It has 2 layers‖,
because while computation the initial, or the input layer, is ignored.
Let me help visualize how a 2 layer Neural network looks like.
Step by step we shall understand this image.
1) As you can see here we have a 2 Layered Artificial Neural Network. A Neural network was
created to mimic the biological neuron of the human brain. In our ANN we have a ―k‖
number of nodes. The number of nodes is a hyperparameter, which essentially means that
the amount is configured by the practitioner making the model.
2) The inputs and outputs layers do not change. We have ―n‖ input features and 3 possible
outcomes.
3) Unlike Logistic regression, neural networks use the tanh function as their activation
function instead of the sigmoid function which you are quite familiar with. The reason is that
the mean of its output is closer to 0 which makes the more centered for input to the next
layer. tanh function can cause an increase in non-linearity which makes our model learn
better.
(Important Note: We shall continue where we left off in the previous article. I’m not going to
waste your time and mine by loading the dataset again and preparing it. The link to the Part
1 of this series is given above.)
Researchers tried to mimic the working of the human brain and replicated it into the
machine making machines capable of thinking and solving complex problems. Deep
Learning (DL) is a subset of Machine Learning (ML) that allows us to train a model using a set
of inputs and then predict output based. Like the human brain, the model consists of a set of
neurons that can be grouped into 3 layers:
a) Input Layer: It receives input and passes it to hidden layers.
b) Hidden Layers: There can be 1 or more hidden layers in Deep Neural Network (DNN).
―Deep‖ in DL refers to having more than 1 layer. All computations are done by hidden layers.
c) Output Layer: This layer receives input from the last hidden layer and gives the output.
We will see how DNN works with the help of the train price prediction problem. For
simplicity, we have taken 3 inputs, namely, Departure Station, Arrival Station, Departure
Date. In this case, the input layer will have 3 neurons, one for each input. The first hidden
layer will receive input from the input layer and will start performing mathematical
computations followed by other hidden layers. The number of hidden layers and number of
neurons in each hidden layer is hyperparameters that are challenging task to decide. The
output layer will give the predicted price value. There can be more than 1 neuron in the
output layer. In our case, we have only 1 neuron as output is a single value.
Now, how the price prediction is made by hidden layers? How computation is done inside
hidden layers? This will be explained with help of activation function, loss function, and
optimizers.
Each neuron has an activation function that performs computation. Different layers can
have different activation functions but neurons belonging to one layer have the same
activation function. In DNN, a weighted sum of input is calculated based on weights and
inputs provided. Then, the activation function comes into the picture that works on weighted
sum and converts it into output.
Activation functions help model learn complex relationship that exists within the dataset. If
we do not use the activation function in neurons and give weighted sum as output, in that
case, computations will be difficult as there is no specific range for weighted sum. So, the
activation function helps to keep output in a particular range. Secondly, the non-linear
activation function is always preferred as it adds non-linearity to the dataset which otherwise
would form a simple linear regression model incapable of taking the benefit of hidden
layers. The relu function or its varients is mostly used for hidden layers and sigmoid/ softmax
function is mostly used for final layer for binary/ multi-class classification problems.
To train the model, we give input (departure location, arrival location and, departure date in
case of train price prediction) to the network and let it predict the output making use of
activation function. Then, we compare predicted output with the actual output and compute
the error between two values. This error between two values is computed using loss/ cost
function. The same process is repeated for entire training dataset and we get the average
loss/error. Now, the objective is to minimize this loss to make the model accurate. There
exist weights between each connection of 2 neurons. Initially, weights are randomly
initialized and the motive is to update these weights with every iteration to get the minimum
value of the loss/ cost function. We can change the weights randomly but that is not efficient
method. Here comes the role of optimizers which updates weights automatically.
23. What are different loss functions and their use case?
Once loss for one iteration is computed, optimizer is used to update weights. Instead of
changing weights manually, optimizers can update weights automatically in small
increments and helps to find the minimum value of the loss/ cost function. Magic of DL!!
Finding minimum value of cost function requires iterating through dataset many times and
thus requires large computational power. Common technique used to update these weights
is gradient descent.
It is used to find minimum value of loss function by updating weights. There are 3 variants:
a) Batch/ Vanila Gradient
In this, gradient for entire dataset is computed to perform one weight update.
It gives good results but can be slow and requires large memory.
b) Stochastic Gradient Descent (SGD)
Weights are updated for each training data point.
Therefore, frequent updates are performed and thus can cause objective function to
fluctuate.
c) Mini-batch Gradient Descent
It takes best of batch gradient and SGD.
It is algorithm of choice.
Reduces frequency of updates and thus can lead to more stable convergence.
Choosing proper learning rate is difficult.
“Convolution neural networks” indicates that these are simply neural networks with some
mathematical operation (generally matrix multiplication) in between their layers called
convolution. It was proposed by Yann LeCun in 1998. It's one of the most popular uses in
Image Classification. Convolution neural network can broadly be classified into these steps:
1. Input layer
2. Convolutional layer
3. Output layers
Input layers are connected with convolutional layers that perform many tasks such as
padding, striding, the functioning of kernels, and so many performances of this layer, this
layer is considered as a building block of convolutional neural networks. We will be
discussing it’s functioning in detail and how the fully connected networks work.
Convolutional Layer: The convolutional layer’s main objective is to extract features from
images and learn all the features of the image which would help in object detection
techniques. As we know, the input layer will contain some pixel values with some weight and
height, our kernels or filters will convolve around the input layer and give results which will
retrieve all the features with fewer dimensions. Let’s see how kernels work;
Max pooling or average pooling reduces the parameters to increase the computation of our
convolutional architecture. Here, 2*2 filters and 2 strides are taken (which we usually use). By
name, we can easily assume that max-pooling extracts the maximum value from the filter
and average pooling takes out the average from the filter. We perform pooling to reduce
dimensionality. We have to add padding only if necessary. The more convolutional layer can
be added to our model until conditions are satisfied.
A similar process occurs in artificial neural network architectures in deep learning. The
segregation plays a key role in helping a neural network properly function, ensuring that it
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING
learns from the useful information rather than get stuck analyzing the not-useful part. And
this is also where activation functions come into the picture. Activation Function helps the
neural network to use important information while suppressing irrelevant data points.
An Activation Function decides whether a neuron should be activated or not. This means
that it will decide whether the neuron’s input to the network is important or not in the
process of prediction using simpler mathematical operations. The role of the Activation
Function is to derive output from a set of input values fed to a node (or a layer). But—Let’s
take a step back and clarify: What exactly is a node? Well, if we compare the neural network
to our brain, a node is a replica of a neuron that receives a set of input signals—external
stimuli.
Depending on the nature and intensity of these input signals, the brain processes them and
decides whether the neuron should be activated (―fired‖) or not. In deep learning, this is also
the role of the Activation Function—that’s why it’s often referred to as a Transfer Function in
Artificial Neural Network. The primary role of the Activation Function is to transform the
summed weighted input from the node into an output value to be fed to the next hidden
layer or as output.
In the image above, you can see a neural network made of interconnected neurons. Each of
them is characterized by its weight, bias, and activation function.
Input Layer
The input layer takes raw input from the domain. No computation is performed at this layer.
Nodes here just pass on the information (features) to the hidden layer.
Hidden Layer
As the name suggests, the nodes of this layer are not exposed. They provide an abstraction
to the neural network.
The hidden layer performs all kinds of computation on the features entered through the
input layer and transfers the result to the output layer.
Output Layer
It’s the final layer of the network that brings the information learned through the hidden
layer and delivers the final value as a result.
Note: All hidden layers usually use the same activation function. However, the output layer
will typically use a different activation function from the hidden layers. The choice depends
on the goal or type of prediction made by the model.
Activation functions introduce an additional step at each layer during the forward
propagation, but its computation is worth it. Here is why— Let’s suppose we have a neural
network working without the activation functions. In that case, every neuron will only be
performing a linear transformation on the inputs using the weights and biases. It’s because it
doesn’t matter how many hidden layers we attach in the neural network; all layers will
behave in the same way because the composition of two linear functions is a linear function
itself. Although the neural network becomes simpler, learning any complex task is
impossible, and our model would be just a linear regression model.
3 Types of Neural Networks Activation Functions
Now, as we’ve covered the essential concepts, let’s go over the most popular neural
networks activation functions.
Binary Step Function: Binary step function depends on a threshold value that decides
whether a neuron should be activated or not. The input fed to the activation function is
compared to a certain threshold; if the input is greater than it, then the neuron is activated,
else it is deactivated, meaning that its output is not passed on to the next hidden layer.
Here’s why sigmoid/logistic activation function is one of the most widely used functions:
It is commonly used for models where we have to predict the probability as an
output. Since probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice because of its range.
The function is differentiable and provides a smooth gradient, i.e., preventing jumps
in output values. This is represented by an S-shape of the sigmoid activation
function.
The limitations of sigmoid function are discussed below:
The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)).
ReLU accelerates the convergence of gradient descent towards the global minimum
of the loss function due to its linear, non-saturating property.
Softmax Function
Before exploring the ins and outs of the Softmax activation function, we should focus on its
building block—the sigmoid/logistic activation function that works on calculating probability
values.
Probability
The output of the sigmoid function was in the range of 0 to 1, which can be thought of as
probability. But — This function faces certain problems. Let’s suppose we have five output
values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we move forward with it? The
answer is: We can’t. The above values don’t make sense as the sum of all the classes/output
probabilities should be equal to 1. You see, the Softmax function is described as a
combination of multiple sigmoids. It calculates the relative probabilities. Similar to the
sigmoid/logistic activation function, the SoftMax function returns the probability of each
class. It is most commonly used as an activation function for the last layer of the neural
network in the case of multi-class classification. Mathematically it can be represented as:
Softmax Function
Let’s go over a simple example together.
Assume that you have three classes, meaning that there would be three neurons in the
output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68]. Applying the
softmax function over these values to give a probabilistic view will result in the following
outcome: [0.58, 0.23, 0.19]. The function returns 1 for the largest probability index while it
returns 0 for the other two array indexes. Here, giving full weight to index 0 and no weight
to index 1 and index 2. So the output would be the class corresponding to the 1st
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING
neuron(index 0) out of three. You can see now how softmax activation function make things
easy for multi-class classification problems.
Scaled Exponential Linear Unit (SELU)
SELU was defined in self-normalizing networks and takes care of internal normalization
which means each layer preserves the mean and variance from the previous layers. SELU
enables this normalization by adjusting the mean and variance. SELU has both positive and
negative values to shift the mean, which was impossible for ReLU activation function as it
cannot output negative values. Gradients can be used to adjust the variance. The activation
function needs a region with a gradient larger than one to increase it.
There are two challenges you might encounter when training your deep neural networks.
Let's discuss them in more detail.
Vanishing Gradients: Like the sigmoid function, certain activation functions squish an
ample input space into a small output space between 0 and 1. Therefore, a large change in
the input of the sigmoid function will cause a small change in the output. Hence, the
derivative becomes small. For shallow networks with only a few layers that use these
activations, this isn’t a big problem. However, when more layers are used, it can cause the
gradient to be too small for training to work effectively.
Exploding Gradients: Exploding gradients are problems where significant error gradients
accumulate and result in very large updates to neural network model weights during
training. An unstable network can result when there are exploding gradients, and the
learning cannot be completed. The values of the weights can also become so large as to
overflow and result in something called NaN values.
You need to match your activation function for your output layer based on the type of
prediction problem that you are solving—specifically, the type of predicted variable.
Here’s what you should keep in mind.
As a rule of thumb, you can begin with using the ReLU activation function and then move
over to other activation functions if ReLU doesn’t provide optimum results.
And here are a few other guidelines to help you out.
1. ReLU activation function should only be used in the hidden layers.
2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they
make the model more susceptible to problems during training (due to vanishing
gradients).
3. Swish function is used in neural networks having a depth greater than 40 layers.
Finally, a few rules for choosing the activation function for your output layer based on the
type of prediction problem that you are solving:
1. Regression - Linear Activation Function
2. Binary Classification—Sigmoid/Logistic Activation Function
3. Multiclass Classification—Softmax
4. Multilabel Classification—Sigmoid
The activation function used in hidden layers is typically chosen based on the type of neural
network architecture.
5. Convolutional Neural Network (CNN): ReLU activation function.
6. Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Summary:
Activation Functions are used to introduce non-linearity in the network.
A neural network will almost always have the same activation function in all hidden
layers. This activation function should be differentiable so that the parameters of the
network are learned in backpropagation.
ReLU is the most commonly used activation function for hidden layers.
While selecting an activation function, you must consider the problems it might face:
vanishing and exploding gradients.
Regarding the output layer, we must always consider the expected value range of the
predictions. If it can be any numeric value (as in case of the regression problem) you
can use the linear activation function or ReLU.
Use Softmax or Sigmoid function for the classification problems.
In a word, accuracy. Deep learning achieves recognition accuracy at higher levels than ever
before. This helps consumer electronics meet user expectations, and it is crucial for safety-
critical applications like driverless cars. Recent advances in deep learning have improved to
the point where deep learning outperforms humans in some tasks like classifying objects in
images. While deep learning was first theorized in the 1980s, there are two main reasons it
has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car
development requires millions of images and thousands of hours of video.
2. Deep learning requires substantial computing power. High-performance GPUs have
a parallel architecture that is efficient for deep learning. When combined with
clusters or cloud computing, this enables development teams to reduce training time
for a deep learning network from weeks to hours or less.
Deep learning applications are used in industries from automated driving to medical devices.
Automated Driving: Automotive researchers are using deep learning to automatically
detect objects such as stop signs and traffic lights. In addition, deep learning is used to
detect pedestrians, which helps decrease accidents.
Aerospace and Defense: Deep learning is used to identify objects from satellites that locate
areas of interest, and identify safe or unsafe zones for troops.
Medical Research: Cancer researchers are using deep learning to automatically detect
cancer cells. Teams at UCLA built an advanced microscope that yields a high-dimensional
data set used to train a deep learning application to accurately identify cancer cells.
Industrial Automation: Deep learning is helping to improve worker safety around heavy
machinery by automatically detecting when people or objects are within an unsafe distance
of machines.
Electronics: Deep learning is being used in automated hearing and speech translation. For
example, home assistance devices that respond to your voice and know your preferences are
powered by deep learning applications.
37. What's the Difference Between Machine Learning and Deep Learning?
Deep learning is a specialized form of machine learning. A machine learning workflow starts
with relevant features being manually extracted from images. The features are then used to
create a model that categorizes the objects in the image. With a deep learning workflow,
relevant features are automatically extracted from images. In addition, deep learning
performs ―end-to-end learning‖ – where a network is given raw data and a task to perform,
such as classification, and it learns how to do this automatically. Another key difference is
deep learning algorithms scale with data, whereas shallow learning converges. Shallow
learning refers to machine learning methods that plateau at a certain level of performance
when you add more examples and training data to the network. A key advantage of deep
learning networks is that they often continue to improve as the size of your data increases. In
machine learning, you manually choose features and a classifier to sort images. With deep
learning, feature extraction and modeling steps are automatic.
https://in.mathworks.com/videos/ai-for-engineers-building-an-ai-system-1603356830725.html
https://in.mathworks.com/videos/object-recognition-deep-learning-and-machine-learning-for-
computer-vision-121144.html
https://in.mathworks.com/videos/introduction-to-deep-learning-what-are-convolutional-neural-
networks--1489512765771.html
Figure: Comparing a machine learning approach to categorizing vehicles (left) with deep
learning (right).
Machine learning offers a variety of techniques and models you can choose based on
your application, the size of data you're processing, and the type of problem you
want to solve.
A successful deep learning application requires a very large amount of data
(thousands of images) to train the model, as well as GPUs, or graphics processing
units, to rapidly process your data.
When choosing between machine learning and deep learning, consider whether you
have a high-performance GPU and lots of labeled data.
If you don’t have either of those things, it may make more sense to use machine
learning instead of deep learning.
Deep learning is generally more complex, so you’ll need at least a few thousand
images to get reliable results.
Having a high-performance GPU means the model will take less time to analyze all
those images.
The reinforcement learning provides the means for robots to learn complex behavior from
interaction on the basis of generalizable behavioural primitives. From the human negative
feedback, the robot learns from its own misconduct.
*********************
Unit 6: Applications
Third Year Bachelor of Engineering (Choice Based Credit System)
Mechanical Engineering (2019 Course)
Board of Studies – Mechanical and Automobile Engineering, SPPU, Pune
(With Effect from Academic Year 2021-22)
Unit 6: Applications
Syllabus:
Content Theory
Human Machine Interaction
Predictive Maintenance and Health Management
Fault Detection
Dynamic System Order Reduction
Image based part classification
Process Optimization
Material Inspection
Tuning of control algorithms
HMI is all about how people and automated systems interact and communicate with
each other. That has long ceased to be confined to just traditional machines in industry
and now also relates to computers, digital systems or devices for the Internet of Things
(IoT).
More and more devices are connected and automatically carry out tasks. Operating all of
these machines, systems and devices needs to be intuitive and must not place excessive
demands on users.
Human-machine interaction is all about how people and automated systems interact
with each other.
HMI now plays a major role in industry and everyday life: More and more devices are
connected and automatically carry out tasks.
A user interface that is as intuitive as possible is therefore needed to enable smooth
operation of these machines. That can take very different forms.
Smooth communication between people and machines requires interfaces: The place
where or action by which a user engages with the machine.
Simple examples are light switches or the pedals and steering wheel in a car: An action is
triggered when you flick a switch, turn the steering wheel or step on a pedal.
However, a system can also be controlled by text being keyed in, a mouse, touch screens,
voice or gestures.
The devices are either controlled directly: Users touch the smartphone’s screen or issue a
verbal command. Or the systems automatically identify what people want: Traffic lights
change color on their own when a vehicle drives over the inductive loop in the road’s
surface.
Other technologies are not so much there to control devices, but rather to complement
our sensory organs. One example of that is a virtual reality glass.
There are also digital assistants: Chatbots, for instance, reply automatically to requests
from customers and keep on learning.
User interfaces in HMI are the places where or actions by which the user engages with
the machine.
A system can be operated by means of buttons, a mouse, touch screens, voice or
gesture, for instance.
One simple example is a light switch – the interface between the machine ―light‖ and a
human being.
It is also possible to differentiate further between direct control, such as tapping a touch
screen, and automatic control.
In the latter case, the system itself identifies what people want.
Think of traffic lights which change color as soon as a vehicle drives over the inductive
loop in the road’s surface.
For a long time, machines were mainly controlled by switches, levers, steering wheels or
buttons; these were joined later by the keyboard and mouse.
Now we are in the age of the touch screen. Body sensors in wearables that automatically
collect data are also modern interfaces.
Voice control is also making rapid advances: Users can already control digital assistants,
such as Amazon Alexa or Google Assistant, by voice.
That entails far less effort. Chatbots are also used in such systems and their ability to
communicate with people is improving more and more thanks to artificial intelligence.
Gesture control is at least as intuitive as voice control. That means robovacs, for example,
could be stopped by a simple hand signal in the future.
Google and Infineon have already developed a new type of gesture control by the name
of ―Soli‖:
Devices can also be operated in the dark or remotely with the aid of radar technology.
Technologies that augment reality now also act as an interface. Virtual reality glasses
immerse users in an artificially created 3D world, while augmented reality glasses
superimpose virtual elements in the real environment.
Mixed reality glasses combine both technologies, thus enabling scenarios to be
presented realistically thanks to their high resolution.
Modern HMI helps people to use even very complex systems with ease. Machines also
keep on getting better at interpreting signals – and that is important in particular in
autonomous driving.
Human needs are identified even more accurately, which means robots can be used in
caring for people, for instance. One potential risk is the fact that hackers might obtain
information on users via the machines’ sensors.
Last but not least, security is vital in human-machine interaction. Some critics also fear
that self-learning machines may become a risk by taking actions autonomously.
It is also necessary to clarify the question of who is liable for accidents caused by HMI.
Whether voice and gesture control or virtual, augmented and mixed reality, HMI
interaction is far from reaching the end of the line.
In future, data from different sensors will also increasingly be combined to capture and
control complex processes optimally.
The human senses will be replicated better and better with the aid of, for example, gas
sensors, 3D cameras and pressure sensors, thus expanding the devices’ capabilities.
In contrast, there will be fewer of the input devices that are customary at present, such as
remote controllers.
Even complex systems will become easier to use thanks to modern human-machine
interaction. To enable that, machines will adapt more and more to human habits and
needs. Virtual reality, augmented reality and mixed reality will also allow them to be
controlled remotely. As a result, humans expand their realm of experience and field of
action.
Machines will also keep on getting better at interpreting signals in future – and that’s
also necessary: The fully autonomous car must respond correctly to hand signals from a
police officer at an intersection. Robots used in care must likewise be able to ―assess‖ the
needs of people who are unable to express these themselves.
The more complex the contribution made by machines is, the more important it is to
have efficient communication between them and users. Does the technology also
understand the command as it was meant? If not, there’s the risk of misunderstandings –
and the system won’t work as it should. The upshot: A machine produces parts that don’t
fit, for example, or the connected car strays off the road.
People, with their abilities and limitations, must always be taken into account in the
development of interfaces and sensors. Operating a machine must not be overly complex
or require too much familiarization. Smooth communication between human and
machine also needs the shortest possible response time between command and action,
otherwise users won’t perceive the interaction as being natural.
One potential risk arises from the fact that machines are highly dependent on sensors to
be controlled or respond automatically. If hackers have access to the data, they obtain
details of the user’s actions and interests. Some critics also fear that even learning
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 6: APPLICATIONS
machines might act autonomously and subjugate people. One question that has also not
been clarified so far is who is liable for accidents caused by errors in human-machine
interaction, and who is responsible for them.
Reference: https://www.infineon.com/cms/en/discoveries/human-machine-interaction/
8. Make a list of maintenance and explain in brief. Discuss the scope of AIML.
9. Explain fault diagnosis (of any suitable machine element) using ML.
Refer following articles and explain the procedure they have adopted.
Sakthivel, N. R., Sugumaran, V., & Babudevasenapati, S. (2010). Vibration based fault
diagnosis of monoblock centrifugal pump using decision tree. Expert Systems with
Applications, 37(6), 4040-4049.
https://www.sciencedirect.com/science/article/pii/S0957417409008689
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 6: APPLICATIONS
10. Explain an intelligent approach for classification of Nuts, Bolts, Washers and
Locating Pins?
An intelligent approach to classify Nuts, Bolts, Washers and Locating Pins as our Cats and
Dogs is explained here.
A flowchart of a Machine Learning algorithm trained on Images of Nuts and Bolts using a
Neural Network Model.
Data-set
We downloaded 238 parts each of the 4 classes (Total 238 x 4 = 952) from various part
libraries available on the internet. Then we took 8 different isometric images of each part.
This was done to augment the data available, as only 238 images for each part would not be
enough to train a good neural network. A single class now has 1904 images (8 isometric
images of 238 parts) a total of 7616 images. Each image is of 224 x 224 pixels.
Images of the 4 classes. 1 part has 8 images. Each image is treated as single data. We then
have our labels with numbers 0,1,2,3 each number corresponds to a particular image and
means it belongs to certain class #Integers and their corresponding classes
{0: 'locatingpin', 1: 'washer', 2: 'bolt', 3: 'nut'} After training on the above images we will then
see how well our model predicts a random image it has not seen.
Methodology
The process took place in 7 steps. We will get to the details later. The brief summary is
1. Data Collection : The data for each class was collected from various standard part
libraries on the internet.
2. Data Preparation : 8 Isometric view screenshots were taken from each image and
reduced to 224 x 224 pixels.
3. Model Selection : A Sequential CNN model was selected as it was simple and good
for image classification
4. Train the Model: The model was trained on our data of 7616 images with 80/20
train-test split
5. Evaluate the Model: The results of the model were evaluated. How well it predicted
the classes?
6. Hyperparameter Tuning: This process is done to tune the hyperparameters to get
better results . We have already tuned our model in this case
7. Make Predictions: Check how well it predicts the real world data
Data Collection
We downloaded the part data of various nuts and bolts from the different part libraries on
the internet. These websites have numerous 3D models for standard parts from various
makers in different file formats. Since we will be using FreeCAD API to extract the images we
downloaded the files in neutral format (STEP).
Data Preparation
Then we ran a program using FreeCAD API that automatically took 8 isometric screenshots
of 224 x 224 pixels of each part. FreeCAD is a free and open-source general-purpose
parametric 3D computer-aided design modeler which is written in Python.
A Convolutional Neural network. A basic visualization of how our algorithm will work
The following code is how our CNN looks like. Don’t worry about it if you don’t understand.
The idea is the 224 x 224 features from each of our data will go through these network and
spit out an answer. The model will adjusts its weights accordingly and after many iterations
will be able to predict a random image’s class.
#Model description
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
===========================================================
======
conv2d_1 (Conv2D) (None, 222, 222, 128) 1280
_________________________________________________________________
activation_1 (Activation) (None, 222, 222, 128) 0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 111, 111, 128) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 109, 109, 128) 147584
_________________________________________________________________
activation_2 (Activation) (None, 109, 109, 128) 0
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 54, 54, 128) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 373248) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 23887936
_________________________________________________________________
dense_2 (Dense) (None, 4) 260
_________________________________________________________________
activation_3 (Activation) (None, 4) 0
===========================================================
======
Total params: 24,037,060
Trainable params: 24,037,060
Non-trainable params: 0
Model Training
Now finally the time has come to train the model using our dataset of 7616 images. So our
[X] is a 3D array of 7616 x 224 x224 and [y] label set is a 7616 x 1 array. For all training
purposes a data must be split into at least two parts: Training and Validation (Test) set (test
and validation are used interchangeably when only 2 sets are involved).
The validation data usually comes from the same distribution as the training set and is the
data the model has not seen. After the model has trained from the training set, it will try to
predict the data of the validation set. How accurately it predicts this, is our validation
accuracy. This is more important than the training accuracy. It shows how well the model
generalizes. In real life application it is common to split it even into three parts. Train,
Validation and Test. For our case we will only split it into a training and test set. It will be a
80–20 split. 80 % of the images will be used for training and 20% will be used for testing.
That is train on 6092 samples, test on 1524 samples from the total 7616.
If the algorithm predicts incorrectly the cost increases, if it predicts correct the cost
decreases.
After training for 15 epochs we can see the following graph of loss and accuracy. (Cost and
loss can be used interchangeably for our case)
Graph generated from matplot.lib showing Training and Validation loss for our model
The loss decreased as the model trained more times. It becomes better at classifying the
images with each epoch. The model is not able to improve the performance much on the
validation set.
Graph generated from matplot.lib showing Training and Validation accuracy for our model
The accuracy increased as the model trains for each epoch. It becomes better at classifying
the images. The accuracy is for the validation set is lower than the training set as it has not
trained on it directly. The final value is 97.64% which is not bad.
Hyperparameter Tuning
The next step would to be change the hyperparameters, the learning rate,number of epochs,
data size etc. to improve our model. In machine learning, a hyperparameter is a parameter
whose value is used to control the learning process. By contrast, the values of other
parameters (typically node weights) are derived via training.[3] For our purpose we have
already modified these parameters before this article was written, in a way to obtain an
optimum performance for display on this article. We increased the dataset size and number
of epochs to improve the accuracy.
catalogue. There are serial codes to remember as a change in a single digit or alphabet
might mean a different type of part.
Reference: Five High-Impact Research Areas in Machine Learning for Materials Science by
Bryce Meredig. https://pubs.acs.org/doi/10.1021/acs.chemmater.9b04078
Over the past several years, the field of materials informatics has grown dramatically. (1)
Applications of machine learning (ML) and artificial intelligence (AI) to materials science are
now commonplace. As materials informatics has matured from a niche area of research into
an established discipline, distinct frontiers of this discipline have come into focus, and best
practices for applying ML to materials are emerging. (2) The purpose of this editorial is to
outline five broad categories of research that, in my view, represent particularly high-impact
opportunities in materials informatics today:
Validation by experiment or physics-based simulation. One of the most common
applications of ML in materials science involves training models to predict materials
properties, typically with the goal of discovering new materials. With the availability of
user-friendly, open-source ML packages such as scikit-learn, (3)keras, (4) and pytorch, (5)
the process of training a model on a materials data set—which requires only a few lines
of python code—has become completely commoditized. Thus, standard practice in
designing materials with ML should include some form of validation, ideally by
experiment (6−8) or, in some cases, by physics-based simulation. (9,10) Of particular
interest are cases in which researchers use ML to identify materials whose properties are
superior to those of any material in the initial training set; (11) such extraordinary results
remain scarce.
ML approaches tailored for materials data and applications. This category
encapsulates a diverse set of method development activities that make ML more
applicable to and effective for a wider range of materials problems. Materials science as a
field is characterized by small, sparse, noisy, multiscale, and heterogeneous
multidimensional (e.g., a blend of scalar property estimates, curves, images, time series,
etc.) data sets. At the same time, we are often interested in exploring very large, high-
dimensional chemistry and processing design spaces. Some method development
examples to address these challenges include new approaches for uncertainty
quantification (UQ), (12) extrapolation detection, (13) multiproperty optimization, (14)
descriptor development (i.e., the design of new materials representations for ML),
(15−17) materials-specific cross-validation, (18,19) ML-oriented data standards, (20,21)
and generative models for materials design. (22)
High-throughput data acquisition capabilities. ML is notoriously data-hungry. Given
the typically very high cost of acquiring materials data, both in terms of time and money,
the materials informatics field is well-served by research that accelerates and
democratizes our ability to synthesize, characterize, and simulate materials. Examples
include high-throughput density functional theory calculations of materials properties,
(23−25) applications of robotics, automation, and operations research to materials
science, (26−30) and natural language processing (NLP) to extract materials data from
text corpora. (31,32)
ML that makes us better scientists. A popular refrain in the materials informatics
community is that ―ML will not replace scientists, but scientists who use ML will replace
those who do not.‖ This bon mot suggests that ML has the potential to make scientists
more effective and enable them to do more interesting and impactful work. We are still
in the nascent stages of creating true ML-based copilots for scientists, but research areas
such as ML model explainability and interpretability (33,34) represent a valuable early
step. Another example is the application of ML to accelerate or simplify materials
characterization. Researchers have used deep learning to efficiently post-process and
understand images generated via existing characterization methods such as scanning
transmission electron microscopy (STEM) (35) and position averaged convergent beam
electron diffraction (PACBED). (36)
Integration of physics within ML, and ML with physics-based simulations. The
paucity of data in many materials applications is a strong motivator for formally
integrating known physics into ML models. One approach to embedding physics within
ML is to develop methods that guarantee certain desirable properties by construction,
such as respecting the invariances present in a physical system. (37) Another strategy is
to use ML to model the difference between simulation outputs and experimental results.
For example, Google and collaborators created TossingBot, a robotic system that learned
to throw objects into bins with the aid of a ballistics simulation. (38) The researchers
found that a physics-aware ML approach, wherein ML learned and corrected for the
discrepancy between the simulations and real-world observations, dramatically
outperformed a pure trial-and-error ML training strategy. In a similar vein, ML can enable
us to derive more value from existing physics-based simulations. For example, ML-based
interatomic potentials (39−41) represent a means of capturing some of the physics of
first-principles simulations in a much more computationally efficient model that can
simulate orders of magnitude more atoms. ML can also serve as ―glue‖ to link physics-
based models operating at various fidelities and length scales. (42)
As ML becomes more widely used in materials research, I expect that efforts addressing one
or more of these five themes will have an outsized impact on both the materials informatics
discipline and materials science more broadly.
Reference: Vasudevan, R., Pilania, G., & Balachandran, P. V. (2021). Machine learning for
materials design and discovery. Journal of Applied Physics, 129(7), 070401.
https://doi.org/10.1063/5.0043300
Liu, Y., Zhao, T., Ju, W., & Shi, S. (2017). Materials discovery and design using machine
learning. Journal of Materiomics, 3(3), 159-177.
https://www.sciencedirect.com/science/article/pii/S2352847817300515
The fundamental framework for the application of machine learning in material property
prediction.
*********************