100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
100-Machine-Learning-Interview-Questions-and-Answers (Downloaded From Internet)
Answers
1. Please Explain Machine Learning, Artificial Intelligence, And Deep Learning?
Machine learning is defined as a subset of Artificial Intelligence, and it contains the techniques
which enable computers to sort things out from the data and deliver Artificial Intelligence
applications. Artificial Intelligence (AI) is a branch of computer science that is mainly focused on
building smart machines that can perform certain tasks that mainly require human intelligence. It
is the venture to replicate or simulate human intelligence in machines.
Deep learning can be defined as a class of machine learning algorithms in Artificial Intelligence
that mainly uses multiple layers to cumulatively extract higher-level features from the given raw
input.
Machine Learning is huge and comprises a lot of things. Therefore, it will take more than six
months to learn Machine Learning if you spend at least 6-7 hours per day. If you have good
hands-on mathematical and analytical skills, then six months will be sufficient for you.
A Kernel Trick is a method where the Non-Linear data is projected onto a bigger dimension
space in order to make it easy to classify the data where it can be linearly divided by a plane.
1. Holdout Method: This kind of technique works by removing the part of the training data
set and sending the same to the model that was trained on the remaining data set to get
the required predictions.
2. K-Fold Cross-Validation: Here, the data is divided into k subsets so that every time, one
among the k subsets can be used as a validation set, and the other k-1 subsets are used
as the training set
3. Stratified K-Fold Cross-Validation: It works on imbalanced data.
4. Leave-P-Out Cross-Validation: Here, we leave p data points out of the training data out
of the n data points, then we use the n-p samples to train the model and p points for the
validation set.
It is a method that merges the same type of It is a method that merges the different types
predictions. of predictions.
It decreases the variance, not the bias It decreases the bias, not the variance.
Each and every model receives equal weight Models are weighed based on performance.
6. What Are Kernels In SVM? Can You List Some Popular Kernels Used In SVM?
The kernel is basically used to set mathematical functions that are used in the Support Vector
Machine by providing the window to manipulate the data. Kernel Function is used to transform
the training set of data so that a non-linear decision surface will be transformed to a linear
equation in a bigger number of dimension spaces.
Some of the popular kernels used in SVM are:
1. Polynomial kernel
2. Gaussian kernel
3. Gaussian radial basis function (RBF)
4. Laplace RBF kernel
5. Hyperbolic tangent kernel
6. Sigmoid kernel
7. Bessel function of the first kind Kernel
8. ANOVA radial basis kernel
An out-of-bag error called OBB error, also known as an out-of-bag estimate, is a technique to
measure the prediction error of random forests, boosted decision trees. Bagging mainly uses
subsampling with replacement to create the training samples for the model to learn from them.
Variance inflation factor known as VIF is a measure of the amount of multicollinearity in the
given set of multiple regression variables. The ratio here is calculated for each of the
independent variables. A high VIF means that the associated independent variable is mostly
collinear with the other variables in the model.
Support Vector Machine, known as SVM, is one of the most commonly used Supervised
Learning algorithms that is mainly used for Classification as well as Regression problems. It is
primarily used for Classification problems in Machine Learning. The main aim of the SVM
algorithm is to create the best decision boundary, which segregates n-dimensional space into
classes so that one can easily put the new obtained data point in the correct category in the
future.
Here, the models need to find the mapping The main aim of unsupervised learning is to
function that is used to map the input variable find the structure and patterns from the given
(X) with the output variable (Y). input data.
Precision, also known as a positive predictive value, is defined as the fraction of relevant
instances among the retrieved instances.
Precision = TP/TP+FP
Where TP is true positive
FP id False Positive
Recall, also known as sensitivity, is defined as the fraction of relevant instances that were
Retrieved.
Recall = TP/TP+FP.
Where TP is true positive
FP is False positive.
A regression model that makes use of the L1 A regression model that makes use of the L1
regularization process is called Lasso regularization process is called Ridge
Regression. Regression.
Lasso Regression adds the absolute value of Ridge regression adds the squared
the magnitude of coefficient as a penalty term magnitude of coefficient as a penalty term to
to the loss function. the loss function.
It tries to estimate the median of the data. It tries to estimate the mean of the data.
The Fourier transform is a way to split something up into a bunch of sine waves. In terms of
mathematics, The Fourier Transform is a process that can transform a signal into its respective
constituent components and frequencies. Fourier transform is used not only in signal, radio,
acoustic, etc.
The F1-score combines both the precision and recall of a classifier into one single metric by
taking the harmonic mean. It is used to compare the performances of two classifiers. For
example, classifier X has a higher recall, and classifier Y has higher precision. Now the
F1-scores calculated for both the classifiers will be used to predict which one produces the
better results.
The F1 score can be calculated as
2(P*R)/(P+R)
Where P is the precision.
R is the Recall of the classification model.
Machine Learning Interview Questions And Answers
There can be a rejection even with an There can be an acceptance even with an
authorized match. unauthorized match.
Here, algorithms are largely self-depicted on Algorithms are detected by the data analysts.
the data analysis
AI (Artificial intelligence) refers to the simulation of human intelligence in machines that are
programmed to reflect like humans and imitate their actions.
Examples: Face Detection and Recognition, Google Maps, and
Ride-Hailing Applications, E-Payments.
21. How To Select Important Variables While Working On A Data Set?
1. You have to remove the correlated variables before selecting important variables.
2. Make use of linear regression and select the variables based on their p values.
3. Use Forward Selection, Stepwise Selection, and Backward Selection.
4. Use Random Forest, Xgboost, and plot variable importance chart
5. Use the Lasso Regression
6. You have to select top n features by measuring the information gain for the available set
of features.
The Causality explicitly applies to the cases where action A causes the outcome of action B.
Correlation can simply be defined as a relationship. Where the actions of A can relate to the
actions of B, but here it is not necessary for one event to cause the other event to happen.
Overfitting is a type of modeling error that results in the failure to predict or guess the future
observations effectively or fit additional data in the model that already exists.
A standard deviation is defined as the number that specifies how spread out the values are. A
low standard deviation represents that most of the numbers are close to the mean value. The
higher standard deviation means that the values are spread out over, the wider range.
Variance in Machine Learning is a type of error that occurs due to the model’s sensitivity to
small fluctuations in the given training set.
A Multilayer Perceptron (MLP) is defined as a class of artificial neural networks that can
generate a set of outputs from the set of given inputs. An MLP consists of several layers of input
nodes that are connected as a directed graph between input and output layers.
The main purpose of the Boltzmann Machine is to optimize the solution to a given problem. It is
mainly used to optimize the weights and quantity related to that specified problem.
Machine Learning Interview Questions And Answers
Classification Regression
Here, the data is labeled in one or multiple Here, you need to predict the quantity
classes. continuously.
It can be evaluated using accuracy. It can be evaluated using root mean squared
error.
In the field of machine learning, a confusion matrix also called an error matrix, is defined as a
specific table layout that allows the user to visualize the performance of an algorithm, mainly a
supervised learning one.
30. When Your Dataset Is Suffering From High Variance, How Would You Handle It?
For datasets with high variance, we can make use of the bagging algorithm. The bagging
algorithm splits the data into different subgroups with sampling replicated from random data.
Once the data is split, using a training algorithm, the random data can be used to create rules.
Then we make use of the polling technique to gather all the predicted outcomes of the model.
It moves from the specific observations to the If there is no theory, you cannot conduct
broad generalizations deductive research.
33. Which Among These Is More Important Model Accuracy Or Model Performance?
Model accuracy is considered as the important characteristic of a Machine Language /AI model.
Whenever we discuss the performance of the model, we first clarify whether it is the model
scoring performance or Model training performance.
Model performance is improved by using distributed computing and parallelizing over the given
scored assets, but we need to carefully build the accuracy during the model training process.
The time series in Machine learning is defined as a set of random variables that are ordered
with respect to time. Time series are studied to interpret a phenomenon, identify the
components of a trend, cyclicity, and predict its future values.
The Information Gain is defined as the amount of information gained about a signal or random
variable from observing another random variable.
Entropy can be defined as the average rate at which information is produced by the stochastic
source of data, Or it can be defined as a measure of the uncertainty that is associated with a
random variable.
36. Differentiate Between Stochastic Gradient Descent (SGD) And Gradient Descent (GD)?
Batch Gradient Descent is involved in calculations over the full training set of each step, which
results in a very slow process on very large training data. Hence, it becomes very expensive to
do Batch GD. However, It is great for relatively smooth error manifolds. Also, it scales well with
the number of features.
Stochastic Gradient Descent tries to solve the primary problem in Batch Gradient descent that is
the usage of entire training data to calculate the gradients as each step. SGD is stochastic in
nature means it picks up some “random” instances of training data at each and every step, and
then it computes the gradient making it faster as there are very little data to manipulate at one
shot,
Batch Gradient Descent Stochastic Gradient Descent
It computes the gradient using the entire It computes gradient using a single Training
Training sample. sample.
It can’t be suggested for huge training It can be suggested for large training
samples. samples.
Gini Entropy
It has values inside the interval [0, 0.5] It has values inside the interval [0, 1]
Ensemble methods are the techniques used to create multiple models and combine them to
produce enhanced results. Ensemble methods usually produce more precise solutions than a
single model would.
In Ensemble Learning, we divide the training data set into multiple subsets, where each subset
is then used to build a separate model. Once the models are trained, they are then combined to
predict an outcome in such a way that there is a reduction in the variance of the output.
Machine Learning Interview Questions And Answers
Multicollinearity occurs when multiple independent variables are highly correlated with each
other in a regression model, which means that an independent variable can be predicted from
another independent variable inside a regression model.
Collinearity mainly occurs when two predictor variables in a multiple regression have some
correlation.
Like random forests, gradient boosting is also a set of decision trees. The two primary
differences are:
1. How trees are built: Each tree in the random forest is built independently, whereas
gradient boosting builds only one tree at a time.
2. Combining results: random forests combine results at the end of the process by
averaging. Whereas gradient boosting combines results along the path.
Eigenvectors are unit vectors, meaning their length or magnitude is equal to 1.0. They are
referred to as right vectors, which means a column vector.
Eigenvalues are coefficients that are applied to eigenvectors that, in turn, give the vectors their
length or magnitude.
Association rule mining (ARM) aims to find out the association rules that will satisfy the
predefined minimum support and confidence from a database. AMO is mainly used to reduce
the number of association rules with the new fitness functions that can incorporate frequent
rules.
Marginalization is a method that requires the summing of the possible values of one variable to
determine the marginal contribution of another variable.
P(X=x) = ∑YP(X=x,Y)
Cluster sampling is defined as a type of sampling method. With cluster sampling, the
researchers usually divide the population into separate groups or sets, known as clusters. Then,
a random sample of clusters is picked from the population. Then the researcher conducts their
analysis on the data from the collected sampled clusters.
The curse of dimensionality basically refers to the increase in the error with the increase in the
number of features. It can be referred to the fact that algorithms are vigorous to design in high
dimensions, and they often have a running time exponential in the dimensions.
48. Can You Name A Few Libraries In Python Used For Data Analysis And Scientific
Computations?
1. NumPy
2. SciPy
3. Pandas
4. SciKit
5. Matplotlib
6. Seaborn
7. Bokeh
49. What Are Outliers? Mention The Methods To Deal With Outliers?
An outlier can be defined as an object that deviates significantly from other objects. They can be
caused by execution errors.
The three main methods to deal with outliers are as follows:
1. Univariate method
2. Multivariate method
3. Minkowski error
50. List Some Popular Distribution Curves Along With Scenarios Where You Will Use Them In
An Algorithm?
51. Can You List The Assumptions For Data To Be Met Before Starting With Linear Regression?
Variance inflation factor that is VIF is defined as a measure of the amount of multicollinearity in a
given set of multiple regression variables.
Mathematically, the Variance inflation factor for a regression model variable is equal to the ratio
of the final model variance to the variance of a model that comprises that single independent
variable.
This ratio is calculated for each of the independent variables. A high VIF represents that the
associated independent variable is hugely collinear with the other variables in the model.
53. Can You Tell Us When The Linear Regression Line Stops Rotating Or Finds An Optimal
Spot Where It Is Fitted On Data?
The place where the highest RSquared value is found is where the line comes to rest.
RSquared usually represents the amount of variance that is captured by the virtual linear
regression line w.r.t the total variance captured by the dataset.
54. Can You Tell Us Which Machine Learning Algorithm Is Known As The Lazy Learner And
Why It Is Called So?
KNN Machine Learning algorithm is called a lazy learner. K-NN is defined as a lazy learner
because it will not learn any machine-learned values or variables from the given training data,
but dynamically it calculates the distance every time it wants to classify. Hence it memorizes the
training dataset instead.
55. Can You Tell Us What Could Be The Problem When The Beta Value For A Specific Variable
Varies Too Much In Each Subset When Regression Is Run On Various Subsets Of The Dataset?
The variations in the beta values in every subset suggest that the dataset is heterogeneous. To
overcome this problem, we use a different model for each of the clustered subsets of the given
dataset, or we use a non-parametric model like decision trees.
If the training set is small in size, high bias or low variance models, for example, Naive Bayes
tends to perform better as they are less likely to overfit.
If the training set is large in size, low bias or high variance models, for example, Logistic
Regression, tend to perform better as they can reflect more complicated relationships.
57. Differentiate Between Training Set And Test Set In A Machine Learning Model?
70% of the total data is taken as the training The remaining 30% is taken as a testing
dataset. dataset.
58. Explain A False Positive And False Negative And How Are They Significant?
A false positive is a concept where you receive a positive result for a given test when you
should have actually received a negative result. It’s also called a “false alarm” or “false positive
error.” It is basically used in the medical field, but it can also apply to software testing.
Examples of False positive:
1. A pregnancy test is positive, where in fact, you are not pregnant.
2. A cancer screening test is positive, but you do not have the disease.
3. Prenatal tests are positive for Down’s Syndrome when your fetus does not have any
disorder.
4. Virus software on your system incorrectly identifies a harmless program as the malicious
one.
A false negative is defined where a negative test result is wrong. In simple words, you get a
negative test result, where you should have got a positive test result.
For example, consider taking a pregnancy test, and you test as negative (not pregnant). But in
fact, you are pregnant.
The false negative pregnancy test results due to taking the test too early, using the diluted urine,
or checking the results very soon. Just about every medical test has the risk of a false negative.
60. Can You Tell Us The Applications Of Supervised Machine Learning In Modern Businesses?
1. Healthcare Diagnosis
2. Fraud detection
3. Email spam detection
4. Sentimental analysis
61. Can You Differentiate Between Inductive Machine Learning And Deductive Machine
Learning?
It observes and learns from the set of It derives the conclusion first, and then it
instances, and then it draws the conclusion. works on it based on the previous decision.
Bias can be defined as the assumptions made by the model to make the target function easy to
approximate.
Variance is defined as the amount that the estimate of the target function will change given the
different training data.
The trade-off is defined as the tension between the error introduced by bias and variance.
Pruning is a data compression process in machine learning and search algorithms that can
reduce the size of the decision trees by removing certain sections of the tree that are non-critical
and unnecessary to classify instances. A tree that is too huge risks overfitting the training data
and is poorly generalizing to the new samples.
Pruning can take place as follows.
1. Top-down fashion (It will travel the nodes and trim subtrees starting at the root)
2. Bottom-up fashion (It will start at the leaf nodes)
We have reduced the error algorithm for the pruning of decision trees.
65. How Reduced Error Algorithms Work For Pruning In Decision Trees?
A decision tree builds classification models as a tree structure, with datasets broken up into
smaller subsets while developing the decision tree; basically, it is a tree-like way with branches
and nodes defined. Decision trees handle both categorical and numerical data.
67. Explain Logistic Regression?
Recommendation systems mainly collect the customer data and auto analyze this data to
generate the customized recommendations for the customers. These systems mainly rely on
implicit data like browsing history and recent purchases and explicit data like ratings provided by
the customer.
K-Nearest Neighbour is the simplest Machine Learning algorithm that is based on the
Supervised Learning technique. It assumes the similarity between the new case or data and the
available cases, and it puts the new case into a category that is similar to that of the available
categories.
For example, we have an image of a creature that looks similar to that of a cat and a dog, but
we want to know whether it is a cat or a dog. For this identification, we can make use of the
KNN algorithm, as it works on a similarity basis. The KNN model will find the similarities of the
new data set to that of the cats and dogs images, and that is based on the similar features; it will
put it in either a cat or a dog category.
71. Considering A Given Long List Of Machine Learning Algorithms, Given A Data Set, How Do
The Spam Filters Of The Email Will Be Fed With Hundreds Of Emails You Decide Which One To
Use?
1. The spam filter of the email will be fed with hundreds of emails.
2. Each of these emails has a label: ‘spam’ or ‘not spam.’
3. The supervised machine learning algorithm will then identify which type of emails are
being marked as spam based on spam keywords like the lottery, no money, full refund,
etc.
4. The next time an email hits the inbox, the spam filter will use statistical analysis and
algorithms like Decision Trees and SVM to identify how likely the email is spam.
5. If the probability is high, then it will be labeled as spam, and the email will not hit your
inbox.
6. Based on the accuracy of each of the models, we use the algorithm with the highest
reliability after testing all the given models.
Selection bias takes place if a data set’s examples are chosen in such a way that it is not
reflective of their real-world distribution. Selection bias can take many various forms.
1. Coverage bias: Data here is not selected in a representative manner.
Example: A model is trained in such a way to predict the future sales of a new product based on
the phone surveys conducted with the sample of customers who bought the product.
Consumers who instead opted for buying a competing product were not surveyed, and as a
result, this set of people were not represented in the training data.
2. Non-response bias: Data here ends up being unrepresentative due to the participation
gaps in the collection of data processes.
Example: A model is trained in such a way to predict the future sales of a new product based
on the phone surveys conducted with a sample of customers who bought the product and with a
sample of customers who bought the competing product. Customers who bought the competing
product were 80% more expected to refuse to complete the survey, and their data were
underrepresented in the sample.
3. Sampling bias: Here, proper randomization is not used during the data collection
process.
Example: A model that is trained to predict the future sales of a new product based on the
phone surveys conducted with a sample of customers who bought the product and with a
sample of customers who bought a competing product. Instead of randomly targeting
customers, the surveyor chose the first 200 consumers that responded to their email, who might
have been more eager about the product than the average purchasers.
In Machine Learning, we encounter the Vanishing Gradient Problem while training the Neural
Networks with gradient-based methods like Back Propagation. This problem makes it hard to
tune and learn the parameters of the earlier layers in the given network.
The vanishing gradients problem can be taken as one example of the unstable behavior that we
may encounter when training the deep neural network.
It describes a situation where the deep multilayer feed-forward network or the recurrent neural
network is not able to propagate the useful gradient information from the given output end of the
model back to the layers close to the input end of the model.
77. Can You Name The Proposed Methods To Overcome The Vanishing Gradient Problem?
It extracts useful information from a large It introduces algorithms from data as well as
amount of data. from past experience.
It is used to understand the flow of data. It teaches the computers to learn and
understand from the data flow.
It has huge databases with unstructured It has existing data as well as algorithms.
data.
It requires human interference in it. No need for the human effort required after
design
Models are developed using data mining machine-learning algorithm can be used in
technique the decision tree, neural networks, and some
other parts of artificial intelligence
It is more of research using methods like It is self-learned and trains the system to do
machine learning. intelligent tasks.
Genetic algorithms are defined as stochastic search algorithms which can act on a population of
possible solutions. Genetic algorithms are mainly used in artificial intelligence to search a space
of potential solutions to find one who can solve the problem.
83. Can You Name The Area Where Pattern Recognition Can Be Used?
1. Speech Recognition
2. Statistics
3. Informal Retrieval
4. Bioinformatics
5. Data Mining
6. Computer Vision
84. Explain The Term Perceptron In Machine Learning?
Isotonic regression is used iteratively to fit ideal distances to protect the relative dissimilarity
order. Isotonic regression is also used in the probabilistic classification to balance the predicted
probabilities of the supervised machine learning models.
A Bayesian network can be defined as a probabilistic graphical model that presents a set of
variables and their conditional dependencies through a DAG (directed acyclic graph).
For example, a Bayesian network would represent the probabilistic relationships between the
diseases and their symptoms. Given the specific symptoms, the network can be used to
compute the possibilities of the presence of different diseases.
87. Can You Explain The Two Components Of The Bayesian Logic Program?
The incremental learning method is defined as the ability of an algorithm to learn from new data
that is available after the classifier has already been generated from the already available
dataset.
90. Can You Explain The Bias-Variance Decomposition Of Classification Error In The Ensemble
Method?
The expected error of the learning algorithm can be divided into bias and variance. A bias term
is a measure of how closely the average classifier produced by the learning algorithm matches
with the target function. The variance term is a measure of how much the learning algorithm’s
prediction fluctuates for various training sets.
The different methods for sequential supervised learning are given below:
1. Recurrent sliding windows
2. Hidden Markow models
3. Maximum entropy Markow models
4. Conditional random fields
5. Graph transformer networks
6. Sliding-window methods
A training dataset is divided into one or more batches. When all the training samples are used in
the creation of one batch, then that learning algorithm is known as batch gradient descent.
When the given batch is the size of one sample, then the learning algorithm is called stochastic
gradient descent.
93. Can You Name The Areas In Robotics And Information Processing Where Sequential
Prediction Problem Arises?
The areas in robotics and information processing where sequential prediction problem arises
are given below
1. Structured prediction
2. Model-based reinforcement learning
3. Imitation Learning
94. Name The Different Categories You Can Categorize The Sequence Learning Process?
The different categories where you can categorize the sequence learning process are listed
below:
1. Sequence generation
2. Sequence recognition
3. Sequential decision
4. Sequence prediction
Sequence prediction aims to predict elements of the sequence on the basis of the preceding
elements.
A prediction model is trained with the set of training sequences. On training, the model is used
to perform sequence predictions. A prediction comprises predicting the next items of a
sequence. This task has a number of applications like web page prefetching, weather
forecasting, consumer product recommendation, and stock market prediction.
Examples of sequence prediction problems include:
1. Weather Forecasting. Given a sequence of observations about the particular weather
over a period of time, it predicts the expected tomorrow’s weather.
2. Stock Market Prediction. Given a sequence of movements of the security over a period
of time, it predicts the next movement of the security.
3. Product Recommendation. Given a sequence of the last purchases of a customer, it
predicts the next purchase of a customer.
Probably approximately correct, i.e., PAC learning is defined as a theoretical framework used for
analyzing the generalization error of the learning algorithm in terms of its error on a given
training set and some measures of the complexity. The main goal here is to typically show that
an algorithm can achieve low generalization error with high probability.
97. What Are PCA, KPCA, And ICA, And What Are They Used For?
Principal Components Analysis(PCA): It linearly transforms the original inputs into the new
uncorrelated features.
Kernel-based Principal Component Analysis(KCPA): It is a nonlinear PCA developed by using
the kernel method.
Independent Component Analysis(ICA): In ICA, the original inputs are linearly transformed into
certain features that are mutually statistically independent.
Machine Learning, especially supervised learning, can be specified as the desire to use the
available data to learn a function that best maps the inputs to outputs.
Technically, this problem is called function approximation, where we are approximating an
unknown target function that we assume as it exists that can best map the given inputs to
outputs on all possible considerations from the problem domain.
An example of the model that approximates the target function and performs the mappings of
inputs to the outputs is known as the hypothesis in machine learning.
The choice of algorithm and the configuration of the algorithm define the space of possible
hypotheses that the model may constitute.
100. Explain The Terms Eepoch, Eentropy, Bbias, And Vvariance In Machine Learning?
Epoch is a term widely used in machine learning that indicates the number of passes of the
whole training dataset that the machine learning algorithm has completed. If the batch size is
the entire training dataset, then the number of epochs is defined as the number of iterations.
Entropy in Machine learning can be defined as the measure of disorder or uncertainty. The main
goal of machine learning models and Data Scientists, in general, is to decrease uncertainty.
Data bias is a type of error in which certain elements of a dataset are more heavily weighted
than others.
Variance is defined as the amount that the estimate of the target function will change if a
different training data set was used. The target function is usually estimated from the training
data by the machine learning algorithm.