Machine Learning Interview Questions
Machine Learning Interview Questions
Artificial Intelligence (AI) is the domain of producing intelligent machines. ML refers to systems
that can assimilate from experience (training data) and Deep Learning (DL) states to systems
that learn from experience on large data sets. ML can be considered as a subset of AI. Deep
Learning (DL) is ML but useful to large data sets. The figure below roughly encapsulates the
relation between AI, ML, and DL:
Additional Information: ASR (Automatic Speech Recognition) & NLP (Natural Language
Processing) fall under AI and overlay with ML & DL as ML is often utilized for NLP and ASR
tasks.
C. Reinforcement Learning:
The model learns through a trial and error method. This kind of learning involves an agent that
will interact with the environment to create actions and then discover errors or rewards of that
action.
So, there is no certain metric to decide which algorithm to be used for a given situation or a data
set. We need to explore the data using EDA (Exploratory Data Analysis) and understand the
purpose of using the dataset to come up with the best fit algorithm. So, it is important to study all
the algorithms in detail.
9. We look at machine learning software almost all the time. How do we apply Machine Learning
to Hardware?
We have to build ML algorithms in System Verilog which is a Hardware development Language
and then program it onto an FPGA to apply Machine Learning to hardware.
10. Explain One-hot encoding and Label Encoding. How do they affect the dimensionality of the
given dataset?
One-hot encoding is the representation of categorical variables as binary vectors. Label
Encoding is converting labels/words into numeric form. Using one-hot encoding increases the
dimensionality of the data set. Label encoding doesn’t affect the dimensionality of the data set.
One-hot encoding creates a new variable for each level in the variable whereas, in Label
encoding, the levels of a variable get encoded as 1 and 0.
Variance is also an error because of too much complexity in the learning algorithm. This can be
the reason for the algorithm being highly sensitive to high degrees of variation in training data,
which can lead your model to overfit the data. Carrying too much noise from the training data for
your model to be very useful for your test data.
The bias-variance decomposition essentially decomposes the learning error from any algorithm
by adding the bias, the variance and a bit of irreducible error due to noise in the underlying
dataset. Essentially, if you make the model more complex and add more variables, you’ll lose
bias but gain some variance — in order to get the optimally reduced amount of error, you’ll have
to trade off bias and variance. You don’t want either high bias or high variance in your model.
14. A data set is given to you and it has missing values which spread along 1 standard deviation
from the mean. How much of the data would remain untouched?
It is given that the data is spread across mean that is the data is spread across an average. So,
we can presume that it is a normal distribution. In a normal distribution, about 68% of data lies in
1 standard deviation from averages like mean, mode or median. That means about 32% of the
data remains uninfluenced by missing values.
16. If your dataset is suffering from high variance, how would you handle it?
For datasets with high variance, we could use the bagging algorithm to handle it. Bagging
algorithm splits the data into subgroups with sampling replicated from random data. After the
data is split, random data is used to create rules using a training algorithm. Then we use polling
technique to combine all the predicted outcomes of the model.
17. A data set is given to you about utilities fraud detection. You have built aclassifier model and
achieved a performance score of 98.5%. Is this a goodmodel? If yes, justify. If not, what can you
do about it?
Data set about utilities fraud detection is not balanced enough i.e. imbalanced. In such a data
set, accuracy score cannot be the measure of performance as it may only be predict the
majority class label correctly but in this case our point of interest is to predict the minority label.
But often minorities are treated as noise and ignored. So, there is a high probability of
misclassification of the minority label as compared to the majority label. For evaluating the
model performance in case of imbalanced data sets, we should use Sensitivity (True Positive
rate) or Specificity (True Negative rate) to determine class label wise performance of the
classification model. If the minority class label’s performance is not so good, we could do the
following:
Identifying missing values and dropping the rows or columns can be done by using IsNull() and
dropna( ) functions in Pandas. Also, the Fillna() function in Pandas replaces the incorrect values
with the placeholder value.
21. What is the difference between stochastic gradient descent (SGD) and gradient descent
(GD)?
Gradient Descent and Stochastic Gradient Descent are the algorithms that find the set of
parameters that will minimize a loss function.
The difference is that in Gradient Descend, all training samples are evaluated for each set of
parameters. While in Stochastic Gradient Descent only one training sample is evaluated for the
set of parameters identified.
22. What is the exploding gradient problem while using the back propagation technique?
When large error gradients accumulate and result in large changes in the neural network
weights during training, it is called the exploding gradient problem. The values of weights can
become so large as to overflow and result in NaN values. This makes the model unstable and
the learning of the model to stall just like the vanishing gradient problem. This is one of the most
commonly asked interview questions on machine learning.
23. Can you mention some advantages and disadvantages of decision trees?
The advantages of decision trees are that they are easier to interpret, are nonparametric and
hence robust to outliers, and have relatively few parameters to tune.
On the other hand, the disadvantage is that they are prone to overfitting.
24. Explain the differences between Random Forest and Gradient Boosting machines.
Random Forests Gradient Boosting
Random forests are a significant number of decision trees pooled using averages or majority
rules at the end. Gradient boosting machines also combine decision trees but at the
beginning of the process, unlike Random forests.
The random forest creates each tree independent of the others while gradient boosting
develops one tree at a time. Gradient boosting yields better outcomes than random forests if
parameters are carefully tuned but it’s not a good option if the data set contains a lot of
outliers/anomalies/noise as it can result in overfitting of the model.
Random forests perform well for multiclass object detection. Gradient Boosting performs
well when there is data which is not balanced such as in real-time risk assessment.
25. What is a confusion matrix and why do you need it?
Confusion matrix (also called the error matrix) is a table that is frequently used to illustrate the
performance of a classification model i.e. classifier on a set of test data for which the true values
are well-known.
P(X=x) = ∑YP(X=x,Y)
Given the joint probability P(X=x,Y), we can use marginalization to find P(X=x). So, it is to find
distribution of one random variable by exhausting cases on other random variables.
The phrase is used to express the difficulty of using brute force or grid search to optimize a
function with too many inputs.
If we have more features than observations, we have a risk of overfitting the model.
When we have too many features, observations become harder to cluster. Too many
dimensions cause every observation in the dataset to appear equidistant from all others and no
meaningful clusters can be formed.
Dimensionality reduction techniques like PCA come to the rescue in such cases.
30. What is the Principle Component Analysis?
The idea here is to reduce the dimensionality of the data set by reducing the number of
variables that are correlated with each other. Although the variation needs to be retained to the
maximum extent.
The variables are transformed into a new set of variables that are known as Principal
Components’. These PCs are the eigenvectors of a covariance matrix and therefore are
orthogonal.
32. What are outliers? Mention three methods to deal with outliers.
Machine Learning Interview Questions-outliners
A data point that is considerably distant from the other similar data points is known as an outlier.
They may occur due to experimental errors or variability in measurement. They are problematic
and can mislead a training process, which eventually results in longer training time, inaccurate
models, and poor results.
Normalisation Standardization
Normalization refers to re-scaling the values to fit into a range of [0,1].
Normalization is useful when all parameters need to have an identical positive scale however
the outliers from the data set are lost. Standardization refers to re-scaling data to have a
mean of 0 and a standard deviation of 1 (Unit variance)
35. List the most popular distribution curves along with scenarios where you will use them in an
algorithm.
The most popular distribution curves are as follows- Bernoulli Distribution, Uniform Distribution,
Binomial Distribution, Normal Distribution, Poisson Distribution, and Exponential Distribution.
Check out the free Probability for Machine Learning course to enhance your knowledge on
Probability Distributions for Machine Learning.
Each of these distribution curves is used in various scenarios.
Bernoulli Distribution can be used to check if a team will win a championship or not, a newborn
child is either male or female, you either pass an exam or not, etc.
Uniform distribution is a probability distribution that has a constant probability. Rolling a single
dice is one example because it has a fixed number of outcomes.
Binomial distribution is a probability with only two possible outcomes, the prefix ‘bi’ means two
or twice. An example of this would be a coin toss. The outcome will either be heads or tails.
Normal distribution describes how the values of a variable are distributed. It is typically a
symmetric distribution where most of the observations cluster around the central peak. The
values further away from the mean taper off equally in both directions. An example would be the
height of students in a classroom.
Poisson distribution helps predict the probability of certain events happening when you know
how often that event has occurred. It can be used by businessmen to make forecasts about the
number of customers on certain days and allows them to adjust supply according to the
demand.
Exponential distribution is concerned with the amount of time until a specific event occurs. For
example, how long a car battery would last, in months.
Shapiro-Wilk W Test
Anderson-Darling Test
Martinez-Iglewicz Test
Kolmogorov-Smirnov Test
D’Agostino Skewness Test
37. What is Linear Regression?
Linear Function can be defined as a Mathematical function on a 2D plane as, Y =Mx +C, where
Y is a dependent variable and X is Independent Variable, C is Intercept and M is slope and
same can be expressed as Y is a Function of X or Y = F(x).
At any given value of X, one can compute the value of Y, using the equation of Line. This
relation between Y and X, with a degree of the polynomial as 1 is called Linear Regression.
39. What is target imbalance? How do we fix it? A scenario where you have performed target
imbalance on data. Which metrics and algorithms do you find suitable to input this data onto?
If you have categorical variables as the target when you cluster them together or perform a
frequency count on them if there are certain categories which are more in number as compared
to others by a very significant number. This is known as the target imbalance.
Example: Target column – 0,0,0,1,0,2,0,0,1,1 [0s: 60%, 1: 30%, 2:10%] 0 are in majority. To fix
this, we can perform up-sampling or down-sampling. Before fixing this problem let’s assume that
the performance metrics used was confusion metrics. After fixing this problem we can shift the
metric system to AUC: ROC. Since we added/deleted data [up sampling or downsampling], we
can go ahead with a stricter algorithm like SVM, Gradient boosting or ADA boosting.
40. List all assumptions for data to be met before starting with linear regression.
Before starting linear regression, the assumptions to be met are as follow:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
41. When does the linear regression line stop rotating or finds an optimal spot where it is fitted
on data?
A place where the highest RSquared value is found, is the place where the line comes to rest.
RSquared represents the amount of variance captured by the virtual linear regression line with
respect to the total variance captured by the dataset.
42. Why is logistic regression a type of classification technique and not a regression? Name the
function it is derived from?
Since the target column is categorical, it uses linear regression to create an odd function that is
wrapped with a log function to use regression as a classifier. Hence, it is a type of classification
technique and not a regression. It is derived from cost function.
43. What could be the issue when the beta value for a certain variable varies way too much in
each subset when regression is run on different subsets of the given dataset?
Variations in the beta values in every subset implies that the dataset is heterogeneous. To
overcome this problem, we can use a different model for each of the dataset’s clustered subsets
or a non-parametric model such as decision trees.
45. Which machine learning algorithm is known as the lazy learner, and why is it called so?
KNN is a Machine Learning algorithm known as a lazy learner. K-NN is a lazy learner because it
doesn’t learn any machine-learned values or variables from the training data but dynamically
calculates distance every time it wants to classify, hence memorizing the training dataset
instead.
49. What are Kernels in SVM? List popular kernels used in SVM along with a scenario of their
applications.
The function of the kernel is to take data as input and transform it into the required form. A few
popular Kernels used in SVM are as follows: RBF, Linear, Sigmoid, Polynomial, Hyperbolic,
Laplace, etc.
51. What are ensemble models? Explain how ensemble techniques yield better learning as
compared to traditional classification ML algorithms.
An ensemble is a group of models that are used together for prediction both in classification and
regression classes. Ensemble learning helps improve ML results because it combines several
models. By doing so, it allows for a better predictive performance compared to a single model.
They are superior to individual models as they reduce variance, average out biases, and have
lesser chances of overfitting.
52. What are overfitting and underfitting? Why does the decision tree algorithm suffer often with
overfitting problems?
Overfitting is a statistical model or machine learning algorithm that captures the data’s noise.
Underfitting is a model or machine learning algorithm which does not fit the data well enough
and occurs if the model or algorithm shows low variance but high bias.
In decision trees, overfitting occurs when the tree is designed to fit all samples in the training
data set perfectly. This results in branches with strict rules or sparse data and affects the
accuracy when predicting samples that aren’t part of the training set.
54. Why boosting is a more stable algorithm as compared to other ensemble algorithms?
Boosting focuses on errors found in previous iterations until they become obsolete. Whereas in
bagging there is no corrective loop. This is why boosting is a more stable algorithm compared to
other ensemble algorithms.
K fold
Stratified k fold
Leave one out
Bootstrapping
Random search cv
Grid search cv
57. Is it possible to test for the probability of improving model accuracy without cross-validation
techniques? If yes, please explain.
Yes, it is possible to test for the probability of improving model accuracy without cross-validation
techniques. We can do so by running the ML model for say n number of iterations, recording the
accuracy. Plot all the accuracies and remove the 5% of low probability values. Measure the left
[low] cut off and right [high] cut off. With the remaining 95% confidence, we can say that the
model can go as low or as high [as mentioned within cut off points].
60. List all types of popular recommendation systems? Name and explain two personalized
recommendation systems along with their ease of implementation.
Popularity based recommendation, content-based recommendation, user-based collaborative
filter, and item-based recommendation are the popular types of recommendation systems.
Personalized Recommendation systems are- Content-based recommendations, user-based
collaborative filter, and item-based recommendations. User-based collaborative filter and
item-based recommendations are more personalized. Easy to maintain: Similarity matrix can be
maintained easily with Item-based recommendations.
61. How do we deal with sparsity issues in recommendation systems? How do we measure its
effectiveness? Explain.
Singular value decomposition can be used to generate the prediction matrix. RMSE is the
measure that helps us understand how close the prediction matrix is to the original matrix.
62. Name and define techniques used to find similarities in the recommendation system.
Pearson correlation and Cosine correlation are techniques used to find similarities in
recommendation systems.
Non-Linear transformations cannot remove overlap between two classes but they can increase
overlap.
Often it is not clear which basis functions are the best fit for a given task. So, learning the basic
functions can be useful over using fixed basis functions.
If we want to use only fixed ones, we can use a lot of them and let the model figure out the best
fit but that would lead to overfitting the model thereby making it unstable.
64. Define and explain the concept of Inductive Bias with some examples.
Inductive Bias is a set of assumptions that humans use to predict outputs given inputs that the
learning algorithm has not encountered yet. When we are trying to learn Y from X and the
hypothesis space for Y is infinite, we need to reduce the scope by our beliefs/assumptions
about the hypothesis space which is also called inductive bias. Through these assumptions, we
constrain our hypothesis space and also get the capability to incrementally test and improve on
the data using hyper-parameters. Examples:
66. Keeping train and test split criteria in mind, is it good to perform scaling before the split or
after the split?
Scaling should be done post-train and test split ideally. If the data is closely packed, then scaling
post or pre-split should not make much difference.
True Positives (TP) – These are the correctly predicted positive values. It implies that the value
of the actual class is yes and the value of the predicted class is also yes.
True Negatives (TN) – These are the correctly predicted negative values. It implies that the
value of the actual class is no and the value of the predicted class is also no.
False positives and false negatives, these values occur when your actual class contradicts with
the predicted class.
Now,
Recall, also known as Sensitivity is the ratio of true positive rate (TP), to all observations in
actual class – yes
Recall = TP/(TP+FN)
Precision is the ratio of positive predictive value, which measures the amount of accurate
positives model predicted viz a viz number of positives it claims.
Precision = TP/(TP+FP)
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted
observation to the total observations.
Accuracy = (TP+TN)/(TP+FP+FN+TN)
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false
positives and false negatives into account. Intuitively it is not as easy to understand as accuracy,
but F1 is usually more useful than accuracy, especially if you have an uneven class distribution.
Accuracy works best if false positives and false negatives have a similar cost. If the cost of false
positives and false negatives are very different, it’s better to look at both Precision and Recall.
68. Plot validation score and training score with data set size on the x-axis and another plot with
model complexity on the x-axis.
For high bias in the models, the performance of the model on the validation data set is similar to
the performance on the training data set. For high variance in the models, the performance of
the model on the validation set is worse than the performance on the training set.
69. What is Bayes’ Theorem? State at least 1 use case with respect to the machine learning
context?
Bayes’ Theorem describes the probability of an event, based on prior knowledge of conditions
that might be related to the event. For example, if cancer is related to age, then, using Bayes’
theorem, a person’s age can be used to more accurately assess the probability that they have
cancer than can be done without the knowledge of the person’s age.
Chain rule for Bayesian probability can be used to predict the likelihood of the next word in the
sentence.
Naive Bayes is considered Naive because the attributes in it (for the class) is independent of
others in the same class. This lack of dependence between two attributes of the same class
creates the quality of naiveness.
72. What do the terms prior probability and marginal likelihood in context of Naive Bayes
theorem mean?
Prior probability is the percentage of dependent binary variables in the data set. If you are given
a dataset and dependent variable is either 1 or 0 and percentage of 1 is 65% and percentage of
0 is 35%. Then, the probability that any new input for that variable of being 1 would be 65%.
Marginal likelihood is the denominator of the Bayes equation and it makes sure that the
posterior probability is valid by making its area 1.
73. Explain the difference between Lasso and Ridge?
Lasso(L1) and Ridge(L2) are the regularization techniques where we penalize the coefficients to
find the optimum solution. In ridge, the penalty function is defined by the sum of the squares of
the coefficients and for the Lasso, we penalize the sum of the absolute values of the
coefficients. Another type of regularization method is ElasticNet, it is a hybrid penalizing function
of both lasso and ridge.
76. Model accuracy or Model performance? Which one will you prefer and why?
This is a trick question, one should first get a clear idea, what is Model Performance? If
Performance means speed, then it depends upon the nature of the application, any application
related to the real-time scenario will need high speed as an important feature. Example: The
best of Search Results will lose its virtue if the Query results do not appear fast.
If Performance is hinted at Why Accuracy is not the most important virtue – For any imbalanced
data set, more than Accuracy, it will be an F1 score than will explain the business case and in
case data is imbalanced, then Precision and Recall will be more important than rest.
77. List the advantages and limitations of the Temporal Difference Learning Method.
Temporal Difference Learning Method is a mix of Monte Carlo method and Dynamic
programming method. Some of the advantages of this method include:
In Under Sampling, we reduce the size of the majority class to match minority class thus help by
improving performance w.r.t storage and run-time execution, but it potentially discards useful
information.
For Over Sampling, we upsample the Minority class and thus solve the problem of information
loss, however, we get into the trouble of having Overfitting.
Synthetic Minority Over-sampling Technique (SMOTE) – A subset of data is taken from the
minority class as an example and then new synthetic similar instances are created which are
then added to the original dataset. This technique is good for Numerical data points.
Visualization
Univariate visualization
Bivariate visualization
Multivariate visualization
Missing Value Treatment – Replace missing values with Either Mean/Median
Outlier Detection – Use Boxplot to identify the distribution of Outliers, then Apply IQR to set the
boundary for IQR
Scaling the Dataset – Apply MinMax, Standard Scaler or Z Score Scaling mechanism to scale
the data.
Feature Engineering – Need of the domain, and SME knowledge helps Analyst find derivative
fields which can fetch more information about the nature of the data
Dimensionality reduction — Helps in reducing the volume of data without losing much
information
80. Mention why feature engineering is important in model building and list out some of the
techniques used for feature engineering.
Algorithms necessitate features with some specific characteristics to work appropriately. The
data is initially in a raw form. You need to extract features from this data before supplying it to
the algorithm. This process is called feature engineering. When you have relevant features, the
complexity of the algorithms reduces. Then, even if a non-ideal algorithm is used, results come
out to be accurate.
Prepare the suitable input data set to be compatible with the machine learning algorithm
constraints.
Enhance the performance of machine learning models.
Some of the techniques used for feature engineering include Imputation, Binning, Outliers
Handling, Log transform, grouping operations, One-Hot encoding, Feature split, Scaling,
Extracting date.
Bootstrap Aggregation or bagging is a method that is used to reduce the variance for algorithms
having very high variance. Decision trees are a particular family of classifiers which are
susceptible to having high bias.
Decision trees have a lot of sensitiveness to the type of data they are trained on. Hence
generalization of results is often much more complex to achieve in them despite very high
fine-tuning. The results vary greatly if the training data is changed in decision trees.
Hence bagging is utilised where multiple decision trees are made which are trained on samples
of the original data and the final result is the average of all these individual models.
Boosting is the process of using an n-weak classifier system for prediction such that every weak
classifier compensates for the weaknesses of its classifiers. By weak classifier, we imply a
classifier which performs poorly on a given data set.
It’s evident that boosting is not an algorithm rather it’s a process. Weak classifiers used are
generally logistic regression, shallow decision trees etc.
There are many algorithms which make use of boosting processes but two of them are mainly
used: Adaboost and Gradient Boosting and XGBoost.
ROC curve
85. What is the difference between a generative and discriminative model?
A generative model learns the different categories of data. On the other hand, a discriminative
model will only learn the distinctions between different categories of data. Discriminative models
perform much better than the generative models when it comes to classification tasks.
86. What are hyperparameters and how are they different from parameters?
A parameter is a variable that is internal to the model and whose value is estimated from the
training data. They are often saved as part of the learned model. Examples include weights,
biases etc.
A hyperparameter is a variable that is external to the model whose value cannot be estimated
from the data. They are often used to estimate model parameters. The choice of parameters is
sensitive to implementation. Examples include learning rate, hidden layers etc.
When choosing a classifier, we need to consider the type of data to be classified and this can be
known by VC dimension of a classifier. It is defined as cardinality of the largest set of points that
the classification algorithm i.e. the classifier can shatter. In order to have a VC dimension of at
least n, a classifier must be able to shatter a single given configuration of n points.
88. What are some differences between a linked list and an array?
Arrays and Linked lists are both used to store linear data of similar types. However, there are a
few difference between them.
Meshgrid () function is used to create a grid using 1-D arrays of x-axis inputs and y-axis inputs
to represent the matrix indexing. Contourf () is used to draw filled contours using the given
x-axis inputs, y-axis inputs, contour line, colours etc.
We can store information on the entire network instead of storing it in a database. It has the
ability to work and give a good accuracy even with inadequate information. A neural network
has parallel processing ability and distributed memory.
Disadvantages:
Neural Networks requires processors which are capable of parallel processing. It’s unexplained
functioning of the network is also quite an issue as it reduces the trust in the network in some
situations like when we have to show the problem we noticed to the network. Duration of the
network is mostly unknown. We can only know that the training is finished by looking at the error
value but it doesn’t give us optimal results.
92. You have to train a 12GB dataset using a neural network with a machine which has only
3GB RAM. How would you go about it?
We can use NumPy arrays to solve this issue. Load all the data into an array. In NumPy, arrays
have a property to map the complete dataset without loading it completely in memory. We can
pass the index of the array, dividing data into batches, to get the data required and then pass
the data into the neural networks. But be careful about keeping the batch size normal.
Code:
Example:
In the above case, fruits is a list that comprises of three fruits. To access them individually, we
use their indexes. Python and C are 0- indexed languages, that is, the first index is 0. MATLAB
on the contrary starts from 1, and thus is a 1-indexed language.
Lists is an effective data structure provided in python. There are various functionalities
associated with the same. Let us consider the scenario where we want to copy a list to another
list. If the same operation had to be done in C programming language, we would have to write
our own function to implement the same.
On the contrary, Python provides us with a function called copy. We can copy a list to another
just by calling the copy function.
new_list = old_list.copy()
We need to be careful while using the function. copy() is a shallow copy function, that is, it only
stores the references of the original list in the new list. If the given argument is a compound data
structure like a list then python creates another object of the same type (in this case, a new list)
but for everything inside old list, only their reference is copied. Essentially, the new list consists
of references to the elements of the older list.
Hence, upon changing the original list, the new list values also change. This can be dangerous
in many applications. Therefore, Python provides us with another functionality called as
deepcopy. Intuitively, we may consider that deepcopy() would follow the same paradigm, and
the only difference would be that for each element we will recursively call deepcopy. Practically,
this is not the case.
deepcopy() preserves the graphical structure of the original compound data. Let us understand
this better with the help of an example:
import copy.deepcopy
a = [1,2]
b = [a,a] # there's only 1 object a
c = deepcopy(b)
Therefore, this prevents unnecessary duplicates and thus preserves the structure of the copied
compound data structure. Thus, in this case, c[0] is not equal to a, as internally their addresses
are different.
Normal copy
>>> a = [[1, 2, 3], [4, 5, 6]]
>>> b = list(a)
>>> a
[[1, 2, 3], [4, 5, 6]]
>>> b
[[1, 2, 3], [4, 5, 6]]
>>> a[0][1] = 10
>>> a
[[1, 10, 3], [4, 5, 6]]
>>> b # b changes too -> Not a deepcopy.
[[1, 10, 3], [4, 5, 6]]
Deep copy
97. Given an array of integers where each element represents the max number of steps that
can be made forward from that element. The task is to find the minimum number of jumps to
reach the end of the array (starting from the first element). If an element is 0, then cannot move
through that element.
Solution: This problem is famously called as end of array problem. We want to determine the
minimum number of jumps required in order to reach the end. The element in the array
represents the maximum number of jumps that, that particular element can take.
We need to reach the end. Therefore, let us have a count that tells us how near we are to the
end. Consider the array A=[1,2,3,1,1]
Let us start from the end and move backwards as that makes more sense intuitionally. We will
use variables right and prev_r denoting previous right to keep track of the jumps.
Initially, right = prev_r = the last but one element. We consider the distance of an element to the
end, and the number of jumps possible by that element. Therefore, if the sum of the number of
jumps possible and the distance is greater than the previous element, then we will discard the
previous element and use the second element’s value to jump. Try it out using a pen and paper
first. The logic will seem very straight forward to implement. Later, implement it on your own and
then verify with the result.
def min_jmp(arr):
n = len(arr)
right = prev_r = n-1
count = 0
# We start from rightmost index and travesre array to find the leftmost index
# from which we can reach index 'right'
while True:
for j in (range(prev_r-1,-1,-1)):
if j + arr[j] >= prev_r:
right = j
if prev_r != right:
prev_r = right
else:
break
count += 1
98. Given a string S consisting only ‘a’s and ‘b’s, print the last index of the ‘b’ present in it.
When we have are given a string of a’s and b’s, we can immediately find out the first location of
a character occurring. Therefore, to find the last occurrence of a character, we reverse the string
and find the first occurrence, which is equivalent to the last occurrence in the original string.
Here, we are given input as a string. Therefore, we begin by splitting the characters element
wise using the function split. Later, we reverse the array, find the first occurrence position value,
and get the index by finding the value len – position -1, where position is the index value.
def split(word):
return [(char) for char in word]
a = input()
a= split(a)
a_rev = a[::-1]
pos = -1
for i in range(len(a_rev)):
if a_rev[i] == ‘b’:
pos = len(a_rev)- i -1
print(pos)
break
else:
continue
if pos==-1:
print(-1)
99. Rotate the elements of an array by d positions to the left. Let us initially look at an example.
A = [1,2,3,4,5]
A <<2
[3,4,5,1,2]
A<<3
[4,5,1,2,3]
There exists a pattern here, that is, the first d elements are being interchanged with last n-d +1
elements. Therefore we can just swap the elements. Correct? What if the size of the array is
huge, say 10000 elements. There are chances of memory error, run-time error etc. Therefore,
we do it more carefully. We rotate the elements one by one in order to prevent the above errors,
in case of large arrays.
#||
# |_|
Solution: We are given an array, where each element denotes the height of the block. One unit
of height is equal to one unit of water, given there exists space between the 2 elements to store
it. Therefore, we need to find out all such pairs that exist which can store water. We need to take
care of the possible cases:
n = int(input())
arr = [int(i) for i in input().split()]
left, right = [arr[0]], [0] * n
# left =[arr[0]]
#right = [ 0 0 0 0…0] n terms
right[n-1] = arr[-1] # right most element
# we use two arrays left[ ] and right[ ], which keep track of elements greater than all
# elements the order of traversal respectively.
Simply put, eigenvectors are directional entities along which linear transformation features like
compression, flip etc. can be applied.
Eigenvalues are the magnitude of the linear transformation features along each direction of an
Eigenvector.
102. How would you define the number of clusters in a clustering algorithm?
Ans. The number of clusters can be determined by finding the silhouette score. Often we aim to
get some inferences from data using clustering techniques so that we can have a broader
picture of a number of classes being represented by the data. In this case, the silhouette score
helps us determine the number of cluster centres to cluster our data along.
103. What are the performance metrics that can be used to estimate the efficiency of a linear
regression model?
Ans. The performance metric that is used in this case is:
Splitting criteria
Min_leaves
Min_samples
Max_depth
109. How to deal with multicollinearity?
Ans. Multi collinearity can be dealt with by the following steps:
111. Is ARIMA model a good fit for every time series problem?
Ans. No, ARIMA model is not suitable for every type of time series problem. There are situations
where ARMA model and others also come in handy.
ARIMA is best when different standard temporal structures require to be captured for time series
data.
112. How do you deal with the class imbalance in a classification problem?
Ans. Class imbalance can be dealt with in the following ways:
115. How to deal with very few data samples? Is it possible to make a model out of it?
Ans. If very few data samples are there, we can make use of oversampling to produce new data
points. In this way, we can have new data points.
PCA takes into consideration the variance. LDA takes into account the distribution of classes.
Manhattan
Minkowski
Tanimoto
Jaccard
Mahalanobis
121. Which metrics can be used to measure correlation of categorical data?
Ans. Chi square test can be used for doing so. It gives the measure of correlation between
categorical predictors.
122. Which algorithm can be used in value imputation in both categorical and continuous
categories of data?
Ans. KNN is the only algorithm that can be used for imputation of both categorical and
continuous variables.
127. If we have a high bias error what does it mean? How to treat it?
Ans. High bias error means that that model we are using is ignoring all the important trends in
the model and the model is underfitting.
To reduce underfitting:
Increasing the number of epochs results in increasing the duration of training of the model. It’s
helpful in reducing the error.
128. Which type of sampling is better for a classification model and why?
Ans. Stratified sampling is better in case of classification problems because it takes into account
the balance of classes in train and test sets. The proportion of classes is maintained and hence
the model performs better. In case of random sampling of data, the data is divided into two parts
without taking into consideration the balance classes in the train and test sets. Hence some
classes might be present only in tarin sets or validation sets. Hence the results of the resulting
model are poor in this case.
1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.
130. When can be a categorical value treated as a continuous variable and what effect does it
have when done so?
Ans. A categorical predictor can be treated as a continuous one when the nature of data points
it represents is ordinal. If the predictor variable is having ordinal data then it can be treated as
continuous and its inclusion in the model increases the performance of the model.
134. Which sampling technique is most suitable when working with time-series data?
Ans. We can use a custom iterative sampling such that we continuously add samples to the
train set. We only should keep in mind that the sample used for validation should be added to
the next train sets and a new sample is used for validation.
Reduces overfitting
Shortens the size of the tree
Reduces complexity of the model
Increases bias
136. What is normal distribution?
Ans. The distribution having the below properties is called normal distribution.
The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the values are to the right.
The total area under the curve is 1.
137. What is the 68 per cent rule in normal distribution?
Ans. The normal distribution is a bell-shaped curve. Most of the data points are around the
median. Hence approximately 68 per cent of the data is around the median. Since there is no
skewness and its bell-shaped.
A chi-square test for independence compares two variables in a contingency table to see if they
are related.
A very small chi-square test statistics implies observed data fits the expected data extremely
well.
141. Which kind of recommendation system is used by amazon to recommend similar items?
Ans. Amazon uses a collaborative filtering algorithm for the recommendation of similar items. It’s
a user to user similarity based mapping of user likeness and susceptibility to buy.
Example – “Stress testing, a routine diagnostic tool used in detecting heart disease, results in a
significant number of false positives in women”
Example – “it’s possible to have a false negative—the test says you aren’t pregnant when you
are”
Naive Bayes:
Work well with small dataset compared to DT which need more data
Lesser overfitting
Smaller in size and faster in processing
Decision Trees:
Decision Trees are very flexible, easy to understand, and easy to debug
No preprocessing or transformation of features required
Prone to overfitting but you can use pruning or Random forests to avoid that.
149. What do you mean by the ROC curve?
Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a
binary classifier. It is calculated/created by plotting True Positive against False Positive at
various threshold settings. The performance metric of ROC curve is AUC (area under curve).
Higher the area under the curve, better the prediction power of the model.
The same calculation can be applied to a naive model that assumes absolutely no predictive
power, and a saturated model assuming perfect predictions.
The likelihood values are used to compare different models, while the deviances (test, naive,
and saturated) can be used to determine the predictive power and accuracy. Logistic regression
accuracy of the model will always be 100 percent for the development data set, but that is not
the case once a model is applied to another data set.
How well does the model fit the data?, Which predictors are most important?, Are the
predictions accurate?
Akaike Information Criteria (AIC): In simple terms, AIC estimates the relative amount of
information lost by a given model. So the less information lost the higher the quality of the
model. Therefore, we always prefer models with minimum AIC.
Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a
binary classifier. It is calculated/ created by plotting True Positive against False Positive at
various threshold settings. The performance metric of ROC curve is AUC (area under curve).
Higher the area under the curve, better the prediction power of the model.
Confusion Matrix: In order to find out how well the model does in predicting the target variable,
we use a confusion matrix/ classification rate. It is nothing but a tabular representation of actual
Vs predicted values which helps us to find the accuracy of the model.
153. What are the advantages of SVM algorithms?
SVM algorithms have basically advantages in terms of complexity. First I would like to clear that
both Logistic regression as well as SVM can form non linear decision surfaces and can be
coupled with the kernel trick. If Logistic regression can be coupled with kernel then why use
SVM?
SVM is a linear separator, when data is not linearly separable SVM needs a Kernel to project
the data into a space where it can separate it, there lies its greatest strength and weakness, by
being able to project data into a high dimensional space SVM can find a linear separation for
almost any data but at the same time it needs to use a Kernel and we can argue that there’s not
a perfect kernel for every dataset.
155. What is the difference between SVM Rank and SVR (Support Vector Regression)?
One is used for ranking and the other is used for regression.
There is a crucial difference between regression and ranking. In regression, the absolute value
is crucial. A real number is predicted.
In ranking, the only thing of concern is the ordering of a set of examples. We only want to know
which example has the highest rank, which one has the second-highest, and so on. From the
data, we only know that example 1 should be ranked higher than example 2, which in turn
should be ranked higher than example 3, and so on. We do not know by how much example 1 is
ranked higher than example 2, or whether this difference is bigger than the difference between
examples 2 and 3.
156. What is the difference between the normal soft margin SVM and SVM with a linear kernel?
Hard-margin
You have the basic SVM – hard margin. This assumes that data is very well behaved, and you
can find a perfect classifier – which will have 0 error on train data.
Soft-margin
Data is usually not well behaved, so SVM hard margins may not have a solution at all. So we
allow for a little bit of error on some points. So the training error will not be 0, but average error
over all points is minimized.
Kernels
The above assume that the best classifier is a straight line. But what is it is not a straight line.
(e.g. it is a circle, inside a circle is one class, outside is another class). If we are able to map the
data into higher dimensions – the higher dimension may give us a straight line.
157. How is linear classifier relevant to SVM?
An svm is a type of linear classifier. If you don’t mess with kernels, it’s arguably the most simple
type of linear classifier.
Linear classifiers (all?) learn linear fictions from your data that map your input to scores like so:
scores = Wx + b. Where W is a matrix of learned weights, b is a learned bias vector that shifts
your scores, and x is your input data. This type of function may look familiar to you if you
remember y = mx + b from high school.
A typical svm loss function ( the function that tells you how good your calculated scores are in
relation to the correct labels ) would be hinge loss. It takes the form: Loss = sum over all scores
except the correct score of max(0, scores – scores(correct class) + 1).
158. What are the advantages of using a naive Bayes for classification?
Very simple, easy to implement and fast.
If the NB conditional independence assumption holds, then it will converge quicker than
discriminative models like logistic regression.
Even if the NB assumption doesn’t hold, it works great in practice.
Need less training data.
Highly scalable. It scales linearly with the number of predictors and data points.
Can be used for both binary and mult-iclass classification problems.
Can make probabilistic predictions.
Handles continuous and discrete data.
Not sensitive to irrelevant features.
159. Are Gaussian Naive Bayes the same as binomial Naive Bayes?
Binomial Naive Bayes: It assumes that all our features are binary such that they take only two
values. Means 0s can represent “word does not occur in the document” and 1s as “word occurs
in the document”.
Gaussian Naive Bayes: Because of the assumption of the normal distribution, Gaussian Naive
Bayes is used in cases when all our features are continuous. For example in Iris dataset
features are sepal width, petal width, sepal length, petal length. So its features can have
different values in the data set as width and length can vary. We can’t represent features in
terms of their occurrences. This means data is continuous. Hence we use Gaussian Naive
Bayes here.
160. What is the difference between the Naive Bayes Classifier and the Bayes classifier?
Naive Bayes assumes conditional independence, P(X|Y, Z)=P(X|Z)
P(X|Y,Z)=P(X|Z)
P(X|Y,Z)=P(X|Z), Whereas more general Bayes Nets (sometimes called Bayesian Belief
Networks), will allow the user to specify which attributes are, in fact, conditionally independent.
For the Bayesian network as a classifier, the features are selected based on some scoring
functions like Bayesian scoring function and minimal description length(the two are equivalent in
theory to each other given that there is enough training data). The scoring functions mainly
restrict the structure (connections and directions) and the parameters(likelihood) using the data.
After the structure has been learned the class is only determined by the nodes in the Markov
blanket(its parents, its children, and the parents of its children), and all variables given the
Markov blanket are discarded.
Discriminant Functions
Probabilistic Generative Models
Bayesian Theorem
Naive Assumptions of Independence and Equal Importance of feature vectors.
Moreover, it is a special type of Supervised Learning algorithm that could do simultaneous
multi-class predictions (as depicted by standing topics in many news apps).
Since these are generative models, so based upon the assumptions of the random variable
mapping of each feature vector these may even be classified as Gaussian Naive Bayes,
Multinomial Naive Bayes, Bernoulli Naive Bayes, etc.
Recall is also known as sensitivity and the fraction of the total amount of relevant instances
which were actually retrieved.
Both precision and recall are therefore based on an understanding and measure of relevance.
165. What Are the Three Stages of Building a Model in Machine Learning?
To build a model in machine learning, you need to follow few steps:
● Classifier in SVM depends only on a subset of points . Since we need to maximize distance
between closest points of two classes (aka margin) we need to care about only a subset of
points unlike logistic regression.