0% found this document useful (0 votes)

55 views

Machine Learning Interview Question

Uploaded by

MohitKhemka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

Machine Learning Interview Question

Uploaded by

MohitKhemka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Machine Learning Practice Q&A set

1. What is Machine Learning?

ML is the art of granting machines the ability to think. It is a field of computer
science that uses statistical techniques to give machines the ability to learn
without being explicitly programmed. Machine Learning deals with building
algorithms that can receive input data, perform statistical analysis to predict
output, and update the output as newer data become available.

2. What are some use cases of Machine Learning from our

daily uses?

1. YouTube/Netflix/Amazon Prime: Recommends us videos or movies or

series based on our interest.
2. Gmail: Classifies spam emails on your behalf
3. Banks: Detect Fraudulent/Anomalous behaviour and holds the transaction.
It also helps in Fraud Detection.
4. Voice Assistance: Alexa/Siri use Machine Learning to response to the
questions asked.

There are Machine Learning use cases in almost every industry today.

3. What are different kinds of Machine Learning? What

does that signify?

The three kinds of learning in Machine Learning are:

1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning

www.learnbay.co
Machine Learning Practice Q&A set

4. What are different steps in Machine Learning?

We perform below steps in Machine Learning:

1. Collect data
2. Filter data
3. Analyse data
4. Train algorithms
5. Test algorithms
6. Use algorithms for future predictions

5. What is Bias and Variance? How can we have

optimum of both?

Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict. Model with high bias pays very
little attention to the training data and oversimplifies the model. It always
leads to high error on training and test data.

Variance is the variability of model prediction for a given data point or a value
which tells us spread of our data. Model with high variance pays a lot of
attention to training data and does not generalize on the data which it hasn’t
seen before. As a result, such models perform very well on training data but
has high error rates on test data.

In supervised learning, underfitting happens when a model unable to capture

the underlying pattern of the data. These models usually have high bias and

www.learnbay.co
Machine Learning Practice Q&A set
low variance. It happens when we have very less amount of data to build an
accurate model or when we try to build a linear model with a nonlinear data.
Also, these kinds of models are very simple to capture the complex patterns in
data like Linear and logistic regression.

In supervised learning, overfitting happens when our model captures the noise
along with the underlying pattern in data. It happens when we train our model
a lot over noisy dataset. These models have low bias and high variance. These
models are very complex like Decision trees which are prone to overfitting.

Bias Variance Trade-off:

For any supervised algorithm, having a high bias error usually means it has low
variance error and vice versa. To be more specific, parametric or linear ML
algorithms often have a high bias but low variance. On the other hand, non-
parametric or non-linear algorithms have vice versa.

The goal of any ML model is to obtain a low variance and a low bias state,
which is often a task due to the parametrization of machine learning
algorithms. Common ways to achieve optimum Bias and Variance are:

a. By minimizing total error

b. Using Bagging and resampling techniques
c. Adjusting minor values in Algorithms

www.learnbay.co
Machine Learning Practice Q&A set
……………………………….

6. What is Linear Regression?

Linear regression is a linear approach to modelling the relationship between a
scalar response (or dependent variable) and one or more explanatory variables
or independent variables.

The case of one explanatory variable is called simple linear regression, for
more than one explanatory variable, the process is called multiple linear
regression.

A linear regression line has an equation of the form

7. What are the assumptions of Linear Regression?

Short Trick: Assumptions can be abbreviated as LINE in order to remember.

L : Linearity ( Relationship between x and y is linear)

I : Independence (Observations are independent of each other)
N : Normality (for any fixed value of x, y is normally distributd)
E : Equal Variance (homoscedasticity)

Long Answer:
There are three main assumptions in a linear regression model:

1. The assumption about the relationship between the dependent and

independent variable: there should be a linear relationship between the
dependent and independent variables. It is known as the ‘linearity
assumption’.

www.learnbay.co
Machine Learning Practice Q&A set
2. Assumptions about the residuals:

 Normality assumption: It is assumed that the error terms, ε(i), are

normally distributed.
(Explanation: If the residuals are not normally distributed, their
randomness is lost, which implies that the model is not able to
explain the relation in the data.)
 Zero mean assumption: It is assumed that the residuals have a
mean value of zero.
(Explanation: Zero conditional means is there which says that there are
both negative and positive errorsthath cancel out on an average. This
helps us to estimate the dependent variable precisely.)

 Constant variance assumption: It is assumed that the residual

terms have the same (but unknown) variance, σ 2 This assumption
is also known as the assumption of homogeneity or
homoscedasticity.
 Independent error assumption: It is assumed that the residual
terms are independent of each other, i.e. their pair-wise
covariance is zero.

(Explanation: because the observations with larger errors will have more
pull or influence on the fitted model.)

3. Assumptions about the estimators:

 The independent variables are measured without error.
 The independent variables are linearly independent of each other,
i.e. there is no multicollinearity in the data.

(Explanation: If the independent variables are not linearly independent

of each other, the uniqueness of the least squares solution (or normal
equation solution) is lost.)

www.learnbay.co
Machine Learning Practice Q&A set
8. What is Regularization? Explain different types of
Regularizations.

Regularization is a technique that is used to solve the overfitting problem of

machine learning models.

The types of Regularization are as follows:

 The L1 regularization (also called Lasso)

 The L2 regularization (also called Ridge)
 The L1/L2 regularization (also called Elastic net)

L1 Regularization: L1 Regularization or Lasso Regularization adds a penalty to

the error function. The penalty is the sum of the absolute values of weights.

Where λ is the tuning parameter called the regularization parameter which

decides how much we want to penalize the model. The term for L1
regularization is highlighted below wa ith red box.

L2 Regularization: L2 Regularization or Ridge Regularization also adds a

penalty to the error function. But the penalty here is the sum of
the squared values of weights.

Where λ is the tuning parameter called the regularization parameter which

decides how much we want to penalize the model. The term for L2
regularization is highlighted below with read box.

www.learnbay.co
Machine Learning Practice Q&A set

Elastic-net Regularization: Elastic-net is a mix of both L1 and L2

regularizations. A penalty is applied to the sum of the absolute values and the
sum of the squared values:

Lambda is a shared penalization parameter while alpha sets the ratio between
L1 and L2 regularization in the Elastic Net Regularization. Hence, we expect a
hybrid behavior between L1 and L2 regularization.

9. How to choose the value of the regularisation

parameter (λ)?
Selecting the regularisation parameter is a tricky business. If the value of λ is
too high, it will lead to extremely small values of the regression
coefficient β, which will lead to the model underfitting (high bias – low
variance).

On the other hand, if the value of λ is 0 (very small), the model will tend to
overfit the training data (low bias – high variance).

There is no proper way to select the value of λ. What you can do is have a sub-
sample of data and run the algorithm multiple times on different sets. Here,
the person has to decide how much variance can be tolerated. Once the user is
satisfied with the variance, that value of λ can be chosen for the full dataset.
One thing to be noted is that the value of λ selected here was optimal for that
subset, not for the entire training data.

10. Explain gradient descent?

Definition: Gradient descent is an optimization algorithm used to find the
values of parameters (coefficients) of a function (f) that minimizes a cost
function (cost).

(Note: Cost function is the average of the loss functions for all the training
examples.)

www.learnbay.co
Machine Learning Practice Q&A set

When it is used: Gradient descent is best used when the parameters cannot be
calculated analytically (e.g. using linear algebra) and must be searched for by
an optimization algorithm.
Details: The goal of any Machine Learning Model is minimize the cost function.
To get the minima of the cost function we use Gradient Descent Algorithm.
Without going too much mathematical, let’s see the steps involved in Gradient
Descent.

Step 1: Initialize the coefficient of the function. in the initial value should either
be 0 a or very small value.
coefficient = 0.0

Step 2: For the coefficient, we have chosen in step 1, we will calculate the cost
by putting it in function.
cost = f(coefficient)

Step 3: Find the derivative(delta) of the cost or the derivative of the function

delta = derivative(cost)

Step 4: Now that we know from the derivative which direction is downhill, we
can now update the coefficient values. A Learning Rate (alpha) must be
specified that controls how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This process is repeated until the cost of the coefficients (cost) is 0.0 or close
enough to zero to be good enough, or the number of epochs is reached.

11. What is Learning Rate? How to choose the value of

the parameter learning rate (α)?
Learning rate is a hyper-parameter that controls how much we are adjusting
the weights of our network with respect the loss gradient. The lower the value,
the slower we travel along the downward slope.

www.learnbay.co
Machine Learning Practice Q&A set
Selecting the value of learning rate is a tricky business. If the value is too small,
the gradient descent algorithm takes ages to converge to the optimal solution.
On the other hand, if the value of the learning rate is high, the gradient
descent will overshoot the optimal solution and most likely never converge to
the optimal solution.

To overcome this problem, you can try different values of alpha over a range of
values and plot the cost vs the number of iterations. Then, based on the
graphs, the value corresponding to the graph showing the rapid decrease can
be chosen.

If you see that the cost is increasing with the number of iterations, your
learning rate parameter is high and it needs to be decreased.

12. How to carry out hypothesis testing in linear

regression?
Hypothesis testing can be carried out in linear regression for the following
purposes:

1. To check whether a predictor is significant for the prediction of the target

variable. Two common methods for this are —
 Using p-values:
If the p-value of a variable is greater than a certain limit (usually 0.05),
the variable is insignificant in the prediction of the target variable.
 By checking the values of the regression coefficient:
If the value of regression coefficient corresponding to a predictor is zero,
that variable is insignificant in the prediction of the target variable and
has no linear relationship with it.
2. To check whether the calculated regression coefficients are good
estimators of the actual coefficients.

13. What is Variance Inflation Factor (VIF)? What is its

significance?

Variance Inflation Factor (VIF) is used to detect the presence of multicollinearity.

Variance inflation factors (VIF) measure how much the variance of the estimated

www.learnbay.co
Machine Learning Practice Q&A set
regression coefficients is inflated as compared to when the predictor variables
are not linearly related.

It is obtained by regressing each independent variable, say X on the remaining

independent variables (say Y and Z) and checking how much of it (of X) is
explained by these variables.

higher the VIF, higher the R-Squared which means the variable X is collinear
with Y and Z variables. If all the variables are completely orthogonal, R-Square
will be 0 resulting in VIF of 1.

Rule of thumb is to avoid any variable with VIF > 5.

14. How does Residual Plot and Q-Q plot help in linear
regression model?

Residual plots and Q-Q plots are used to visually check that your data meets
the homoscedasticity and normality assumptions of linear regression.

A residual plot lets you see if your data appears homoscedastic.

Homoscedasticity means that the residuals, the difference between the
observed value and the predicted value, are equal across all values of your
predictor variable. If your data are homoscedastic then you will see the points
randomly scattered around the x axis. If they are not (e.g. if they form a curve,
bowtie, fan etc.) then it suggests that your data doesn't meet the assumption.

Q-Q plots let you check that the data meet the assumption of normality. They
compare the distribution of your data to a normal distribution by plotting the
quartiles of your data against the quartiles of a normal distribution. If your
data are normally distributed, then they should form an approximately straight
line.

15. What is difference between R-Squared and

Adjusted R Squared?
Short Answer: In case of adjusted R2 we include only the important and useful
variables, whereas in R2 all the variables are included.

www.learnbay.co
Machine Learning Practice Q&A set
Long Answer: R-squared or R2 explains the degree to which your input
variables explain the variation of your output / predicted variable. So, if R-
square is 0.8, it means 80% of the variation in the output variable is explained
by the input variables. So, in simple terms, higher the R squared, the more
variation is explained by your input variables and hence better is your model.

However, the problem with R-squared is that it will either stay the same or
increase with addition of more variables, even if they do not have any
relationship with the output variables. This is where “Adjusted R square”
comes to help. Adjusted R-square penalizes you for adding variables which do
not improve your existing model. Adjusted R2 also explains the degree to
which your input variables explain the variation of your output / predicted
variable but adjusts for the number of terms in a model. If you add more and
more useless variables to a model, adjusted r-squared will decrease. If you add
more useful variables, adjusted r-squared will increase.
Adjusted R2 will always be less than or equal to R2. So, basically in adjusted R2
we include only the important and useful variables, whereas in R2 all the
variables are included.

Hence, if you are building Linear regression on multiple variable, it is always

suggested that you use Adjusted R-squared to judge goodness of model. In
case you only have one input variable, R-square and Adjusted R squared would
be exactly same.

Typically, the more non-significant variables you add into the model, the gap in
R-squared and Adjusted R-squared increases.

16. What are forecast KPI’s or metrics? Explain Bias

RMSE, MAE, MAPE.

Forecast KPI’s are used to evaluate forecast accuracy. The several KPI’s for it is
as follows:

Bias: Bias represents the historical average error. Basically, will your forecasts
be on average too high (i.e. you overshot the demand) or too low (i.e.
you undershot the demand)? This will give you the overall direction of the error.

www.learnbay.co
Machine Learning Practice Q&A set

Mean Absolute Percentage Error MAPE is the sum of the individual absolute
errors divided by the demand (each period separately). Actually, it is the
average of the percentage errors.

Issue with MAPE: MAPE divides each error individually by the demand, so it is
skewed: high errors during low-demand periods will have a major impact on
MAPE. Due to this, optimizing MAPE will result in a strange forecast that will
most likely undershoot the demand. Just avoid it.

The Mean Absolute Error (MAE) is a very good KPI to measure forecast
accuracy. As the name implies, it is the mean of the absolute error.

Issue with MAE: One of the first issues of this KPI is that it is not scaled to the
average demand. If one tells you that MAE is 10 for a particular item, you
cannot know if this is good or bad. If your average demand is 1000, it is of
course astonishing, but if the average demand is 1, this is a very poor accuracy.
To solve this, it is common to divide MAE by the average demand to get a %

RMSE: It is defined as the square root of the average squared error.

www.learnbay.co
Machine Learning Practice Q&A set

Issue with RMSE: Just as for MAE, RMSE is not scaled to the demand. We can
then define RMSE% as such,

(In all above equations, et is an error term)

17. What is Multicollinearity and what to do with it?

How to remove multicollinearity?
In regression, "multicollinearity" refers to predictors that are correlated with
other predictors. Multicollinearity occurs when your model includes multiple
factors that are correlated not just to your response variable, but also to each
other. In other words, it results when you have factors that are a bit
redundant.

If multicollinearity is a problem in your model -- if the VIF for a factor is near or

above 5 -- the solution may be relatively simple. Try one of these:

 Remove highly correlated predictors from the model. If you have two
or more factors with a high VIF, remove one from the model. Because
they supply redundant information, removing one of the correlated
factors usually doesn't drastically reduce the R-squared. Consider
using stepwise regression, best subsets regression, or specialized
knowledge of the data set to remove these variables. Select the model
that has the highest R-squared value.

 Use Principal Components Analysis, regression methods that cut the

number of predictors to a smaller set of uncorrelated components.

www.learnbay.co
Machine Learning Practice Q&A set
Linear Regression Code Snippet.

…..

Logistics Regression

18. What is Logistic Regression? What is Logistics

Function or Sigmoid Function and What is the range
of value of this function?

www.learnbay.co
Machine Learning Practice Q&A set

Logistic regression is a supervised learning classification algorithm used to

predict the probability of a target variable. The target or dependent variable
can only have 2 possible classes (e.g.: Pass/Fail, 0/1, Spam/No-Spam etc.)

Logistics Function/Sigmoid Function: The Logistic Function also called the

sigmoid function. It’s an S-shaped curve that can take any real-valued number
and map it into a value between 0 and 1, but never exactly at those limits.

1 / (1 + e^-value)

The ‘e’ in the above equation represents the S-shaped curve that has values
between 0 and 1. We write the equation for logistic regression as follows:

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))

In the above equation, b0 and b1 are the two coefficients of the input x. We
estimate these two coefficients using “maximum likelihood estimation”. Below
is the maximum log-likelihood cost function for Logistic regression.

To get a better grasp on this cost function, let's look at the cost that we
calculate for one single-sample instance:

www.learnbay.co
Machine Learning Practice Q&A set

Looking at the preceding equation, we can see that the first term becomes
zero if y = 0, and the second term becomes zero if y =1, respectively:

19. Why is logistics regression despite being a

classification algorithm have “Regression” in its
name?

Although the dependent variable in logistic regression is binary, the predicted

log odds can be any real number between negative infinity and positive
infinity. In other words, your predictions are continuous. In order to use
logistic regression for classification, you need to convert the log odds to a
probability (which is also continuous from 0 to 1) and then set a probability
threshold. The choice of threshold is done outside the model.

20. What is Maximum Likelihood?

It is a method in statistics for estimating parameter(s) of a model for given

data. The basic intuition behind MLE is that the estimate which explains the
data best, will be the best estimator.

The main advantage of MLE is that it has asymptotic property. It means that
when the size of the data increases, the estimate converges faster towards the
population parameter. We use MLE for many techniques in statistics to
estimate parameters. I have explained the general steps we follow to find an
estimate for the parameter.

Step 1: Make an assumption about the data generating function.

Step 2: Formulate the likelihood function for the data, using the data
generating function.

www.learnbay.co
Machine Learning Practice Q&A set
The likelihood function is nothing but the probability of observing this data
given the parameters (P(D|θ)). The parameters depend on our assumptions
and the data-generating function.

Step 3: Find an estimator for the parameter using an optimization technique.

This is done by finding the estimate that maximizes the likelihood function.
This is the reason why we name the estimator calculated using MLE as M-
estimator.

Example: We have tossed a coin n times and observed k heads. Note that we
consider the read is a success the and tail is a failure.

Step 1: (Assumption): The coin follows the Bernoulli distribution function.

Step 2: (Likelihood Function): The likelihood function is binomial distribution

function P(D|θ)) in this case. We need to find out the best estimate for p
(Probability of getting head) given that k of n tosses are Heads.
Step 3: (Estimation): M- estimator is

𝑃̂ =𝑘/n

21. What is odds ratio?

If the probability of something happening is p, the odds-ratio is given by p/(1-

p).

Example:
Suppose in a throw of a fair dice, A is the event that either 1 or 2 will surface.

So, p(A) = p = 1/3 and hence odds of A = (1/3)/(2/3) = 1/2

Odds of A is 50%, in this case, which can be explained as the likelihood of the
event A is 50% of the likelihood of the complementary event of A.

22. Can we use Linear Regression in place of Logistics

Regression for classification?

www.learnbay.co
Machine Learning Practice Q&A set

No. The reasons why linear regressions cannot be used in case of binary
classification are as follows:
Distribution of error terms: The distribution of data in case of linear and
logistic regression is different. Linear regression assumes that error terms are
normally distributed. In case of binary classification, this assumption does not
hold true.
Model output: In linear regression, the output is continuous. In case of binary
classification, an output of a continuous value does not make sense. For binary
classification problems, linear regression may predict values that can go
beyond 0 and 1. If we want the output in the form of probabilities, which can
be mapped to two different classes, then its range should be restricted to 0
and 1. As the logistic regression model can output probabilities with
logistic/sigmoid function, it is preferred over linear regression.
Variance of Residual errors: Linear regression assumes that the variance of
random errors is constant. This assumption is also violated in case of logistic
regression.

23. What are the assumptions of Logistics Regression?

1. The logistic regression assumes that there is minimal or no multicollinearity

among the independent variables.

2. The Logistic regression assumes that the independent variables are linearly
related to the log of odds.

3. The logistic regression usually requires a large sample size to predict

properly.

4. The Logistic regression which has two classes assumes that the dependent
variable is binary and ordered logistic regression requires the dependent
variable to be ordered.

5. The Logistic regression assumes the observations to be independent of

each other.

www.learnbay.co
Machine Learning Practice Q&A set
24. What is the decision boundary in case of Logistics
Regression?

Decision boundary helps to differentiate probabilities into positive class and

negative class.

For Logistic Regression, we only have Linear Decision Boundary. The Boundary
or line that separates the two classes are chosen by setting a threshold
probability.

Step1 : Sigmoid gave the output in the range 0 to 1.

Step2: Set a threshold probability (assume 0.5)

Step3: Classification will happen by separating the points with probability less
than or greater than 0.5

(In the above example, 0.5 is a linear decision boundary).

25. Which cost function is used in Logistics Regression?

Why cannot we use MSE (Mean Squared Error) as a
cost function in Logistics Regression?

The cost function used in Logistics Regression is Sigmoid Function or Logistic

Function.
The main reason not to use the MSE as the cost function for logistic regression
is because you don't want your cost function to be non-convex in nature. If the
cost function is not convex then it is difficult for the function to optimally
converge.

26. Can logistics regression handle categorical

variables directly? If not, then how to use categorical
value in Logistics Regression?

www.learnbay.co
Machine Learning Practice Q&A set
The inputs to a logistic regression model need to be numeric. The algorithm
cannot handle categorical variables directly. So, they need to be converted into
a format that is suitable for the algorithm to process.

The various levels of a categorical variable will be assigned a unique numeric

value known as the dummy variable. These dummy variables are handled by
the logistic regression model as any other numeric value.

27. How does Logistics Regression deals with

Multiclass Classification? What is one-vs-all method?

The most famous method of dealing with multiclass classification using logistic
regression is using the one-vs-all approach. Under this approach, a number of
models are trained, which is equal to the number of classes. The models work
in a specific way.

For example, the first model classifies the datapoint depending on whether it
belongs to class 1 or some other class; the second model classifies the
datapoint into class 2 or some other class. This way, each data point can be
checked over all the classes.

28. What is the evaluation matrix for Classification

Algorithms?

The different ways are as follows:

 Confusion matrix
 Accuracy
 Precision
 Recall
 Specificity
 F1 score
 Precision-Recall or PR curve
 ROC (Receiver Operating Characteristics) curve
 PR vs ROC curve

Confusion Matrix: it is a performance measurement for machine learning

classification problem where output can be two or more classes. It is a table
with 4 different combinations of predicted and actual values.

www.learnbay.co
Machine Learning Practice Q&A set

Definition of the terms:

Positive (P) : Observation is positive (for example: is an apple).

Negative (N) : Observation is not positive (for example: is not an apple).
True Positive (TP) : Observation is positive, and is predicted to be positive.
False Negative (FN) : Observation is positive, but is predicted negative.
True Negative (TN) : Observation is negative, and is predicted to be negative.
False Positive (FP) : Observation is negative, but is predicted positive.

Classification Rate/Accuracy: Classification Rate or Accuracy is given by the

relation:

However, there are problems with accuracy. It assumes equal costs for both
kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or
terrible depending upon the problem.

Recall/Sensitivity/True Positive Rate: Recall can be defined as the ratio of the

total number of correctly classified positive examples divide to the total
number of positive examples. High Recall indicates the class is correctly
recognized (a small number of FN).

Precision: To get the value of precision we divide the total number of correctly
classified positive examples by the total number of predicted positive
examples. High Precision indicates an example labelled as positive is indeed
positive (a small number of FP).

www.learnbay.co
Machine Learning Practice Q&A set

F-measure/F-stats/F1 Score: Since we have two measures (Precision and

Recall) it helps to have a measurement that represents both of them. We
calculate an F-measure which uses Harmonic Mean in place of Arithmetic
Mean as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.

Specificity: Percentage of negative instances out of the total actual negative

instances. Therefore, denominator (TN + FP) here is the actual number of
negative instances present in the dataset. It is similar to recall but the shift is
on the negative instances. Like finding out how many healthy patients were
not having cancer and were told they don’t have cancer. Kind of a measure to
see how separate the classes are.

PR Curve: It is the curve between precision and recall for various threshold
values. In the figure below we have 6 predictors showing their respective
precision-recall curve for various threshold values. The top right part of the
graph is the ideal space where we get high precision and recall. Based on our
application we can choose the predictor and the threshold value. PR AUC is
just the area under the curve. The higher its numerical value the better.

www.learnbay.co
Machine Learning Practice Q&A set

ROC Curve: ROC stands for receiver operating characteristic and the graph is
plotted against TPR and FPR for various threshold values. As TPR increases FPR
also increases. As you can see in the first figure, we have four categories and
we want the threshold value that leads us closer to the top left corner.
Comparing different predictors (here 3) on a given dataset also becomes easy
as you can see in figure 2, one can choose the threshold according to the
application at hand. ROC AUC is just the area under the curve, the higher its
numerical value the better.

PR vs ROC Curve:
Both the metrics are widely used to judge a model’s performance.

Which one to use PR or ROC?

The answer lies in TRUE NEGATIVES.

www.learnbay.co
Machine Learning Practice Q&A set
Due to the absence of TN in the precision-recall equation, they are useful in
imbalanced classes. In the case of class imbalance when there is a majority of
the negative class. The metric doesn’t take much into consideration the high
number of TRUE NEGATIVES of the negative class, which is in majority, giving
better resistance to the imbalance. This is important when the detection of the
positive class is very important.

Like to detect cancer patients, which has a high-class imbalance because very
few have it out of all the diagnosed. We certainly don’t want to miss on a
person having cancer and going undetected (recall) and be sure the detected
one is having it (precision).
Due to the consideration of TN or the negative class in the ROC equation, it is
useful when both the classes are important to us. Like the detection of cats
and dog. The importance of true negatives makes sure that both the classes
are given importance, like the output of a CNN model in determining the image
is of a cat or a dog.

29. Logistic Regression Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set

…..
30. What is Decision Tree Algorithm?

Decision tree is a supervised machine learning algorithm mainly used for

the Regression and Classification. It breaks down a dataset into smaller and
smaller subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes and leaf
nodes. Decision tree can handle both categorical and numerical data.

www.learnbay.co
Machine Learning Practice Q&A set
31. What are the different algorithms used in Decision
Tree? Discuss.

Algorithms used in Decision Trees:

1. CART (Classification and Regression Trees): Uses Gini as splitting criteria.

2. C4.5 (successor of ID3): Uses Entropy as splitting Criteria.
3. ID3 (Iterative Dichotomiser 3): Uses Entropy as Splitting Criteria.
4. CHAID (Chi-square automatic interaction detection): Uses Chi-Square as
splitting criteria.

32. What is Pruning? What are different types of

Pruning?

Pruning is a technique used in Decision Tree that reduces the size of decision
trees by removing sections of the tree that provide little power to classify
instances. Pruning reduces the complexity of the final classifier, and hence
improves predictive accuracy by the reduction of overfitting.

There are generally two methods for pruning trees: pre-pruning and post-
pruning.

Pre-pruning is going to involve techniques that perform early stopping (e.g. we

stop the building of our tree before it is fully grown).

Post-pruning will involve fully growing the tree in its entirety, and then
trimming the nodes of the tree in a bottom-up fashion.

33. What is bagging?

Bagging, also called Bootstrap aggregating, is a machine learning ensemble

algorithm designed to improve the stability and accuracy of machine learning
algorithms used in statistical classification and regression. It also reduces
variance and helps to avoid overfitting. Although it is usually applied to
decision tree methods, it can be used with any type of method. Bagging is a
special case of the model averaging approach.

www.learnbay.co
Machine Learning Practice Q&A set
Let’s assume we have a sample dataset of 1000 instances (x) and we are using
the CART algorithm. Bagging of the CART algorithm would work as follows.

 Create many (e.g. 100) random sub-samples of our dataset with

replacement.
 Train a CART model on each sample.
 Given a new dataset, calculate the average prediction from each model.

34. What are advantages and disadvantages of using

Decision Trees?

Advantages
1. Easy to Understand: Decision tree output is very easy to understand
even for people from non-analytical background. It does not require any
statistical knowledge to read and interpret them. Its graphical
representation is very intuitive and users can easily relate their
hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to
identify most significant variables and relation between two or more
variables. With the help of decision trees, we can create new variables /
features that has better power to predict target variable. You can refer
article (Trick to enhance power of regression model) for one such trick.
It can also be used in data exploration stage. For example, we are
working on a problem where we have information available in hundreds
of variables, there decision tree will help to identify most significant
variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modelling techniques. It is not influenced by outliers and
missing values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and
categorical variables.
5. Non-Parametric Method: Decision tree is considered to be a non-
parametric method. This means that decision trees have no assumptions
about the space distribution and the classifier structure.

Disadvantages

www.learnbay.co
Machine Learning Practice Q&A set
1. Over fitting: Over fitting is one of the most practical difficulty for
decision tree models. This problem gets solved by setting constraints on
model parameters and pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous
numerical variables, decision tree loses information when it categorizes
variables in different categories.
3. Decision Trees do not work well if you have smooth boundaries. i.e they
work best when you have discontinuous piece wise constant model. If
you truly have a linear target function decision trees are not the best.

4. Decision Tree's do not work best if you have a lot of un-correlated

variables. Decision tree's work by finding the interactions between
variables. if you have a situation where there are no interactions
between variables linear approaches might be the best.

5. Data fragmentation: Each split in a tree leads to a reduced dataset under

consideration. And, hence the model created at the split will potentially
introduce bias.

35. What are the different splitting criteria used in

Decision Tree? Discuss them in detail.

36. What is Ensemble Modelling? Why is it used?

www.learnbay.co
Machine Learning Practice Q&A set

Ensemble modelling is a process where multiple diverse models are created to

predict an outcome, either by using many different modelling algorithms or using
different training data sets. The ensemble model then aggregates the prediction of
each base model and results in once final prediction for the unseen data. The
motivation for using ensemble models is to reduce the generalization error of the
prediction.
How to avoid Over-fitting in Decision Tree? Discuss the parameters/hyper
parameters in detail.

37. Decision Tree Code Snippet:

…..

www.learnbay.co
Machine Learning Practice Q&A set
38. What is Random Forest? Why is it preferred over
Decision Trees?

Random forest is a supervised learning algorithm which is used for both

classification as well as regression. But however, it is mainly used for
classification problems. As we know that a forest is made up of trees and more
trees means more robust forest. Similarly, random forest algorithm creates
decision trees on data samples and then gets the prediction from each of them
and finally selects the best solution by means of voting. It is an ensemble
method which is better than a single decision tree because it reduces the over-
fitting by averaging the result.

39. What are the advantages and disadvantages of

Random Forest?

Advantage

The following are the advantages of Random Forest algorithm −

 It overcomes the problem of overfitting by averaging or combining the
results of different decision trees.
 Random forests work well for a large range of data items than a single
decision tree does.
 Random forest has less variance then single decision tree.
 Random forests are very flexible and possess very high accuracy.
 Scaling of data does not require in random forest algorithm. It maintains
good accuracy even after providing data without scaling.
 Random Forest algorithms maintains good accuracy even a large
proportion of the data is missing.

Disadvantage

The following are the disadvantages of Random Forest algorithm −

 Complexity is the main disadvantage of Random forest algorithms.

www.learnbay.co
Machine Learning Practice Q&A set
 Construction of Random forests are much harder and time-consuming
than decision trees.
 More computational resources are required to implement Random
Forest algorithm.
 It is less intuitive in case when we have a large collection of decision trees.
 The prediction process using random forests is very time-consuming in
comparison with other algorithms.

40. What are the different steps involved in Random

Forest Modelling?

A random forest can be considered as an ensemble of decision trees. The idea

behind ensemble learning is to combine weak learners to build a more robust
model, a strong learner, that has a better generalization error and is less
susceptible to overfitting.

The random forest algorithm can be summarized in four simple steps:

1. Draw a random bootstrap sample of size n (randomly choose n samples from
the training set with replacement).
2. Grow a decision tree from the bootstrap sample. At each node:
 Randomly select d features without replacement.
 Split the node using the feature that provides the best split
according to the objective function, for instance, by maximizing
the information gain.
3. Repeat the steps 1 to 2 k times.
4. Aggregate the prediction by each tree to assign the class label by majority
vote.

There is a slight modification in step 2 when we are training the individual

decision trees: instead of evaluating all features to determine the best split at
each node, we only consider a random subset of those.

41. What is cross validation? Why is it used?

www.learnbay.co
Machine Learning Practice Q&A set
The technique of cross validation (CV) is best explained by example using the
most common method, K-Fold CV When we approach a machine learning
problem, we make sure to split our data into a training and a testing set.

In K-Fold CV, we further split our training set into K number of subsets, called
folds. We then iteratively fit the model K times, each time training the data on
K-1 of the folds and evaluating on the Kth fold (called the validation data).

As an example, consider fitting a model with K = 5. The first iteration we train

on the first four folds and evaluate on the fifth. The second time we train on
the first, second, third, and fifth fold and evaluate on the fourth. We repeat this
procedure 3 more times, each time evaluating on a different fold. At the very
end of training, we average the performance on each of the folds to come up
with final validation metrics for the model.

For hyperparameter tuning, we perform many iterations of the entire K-Fold CV

process, each time using different model settings. We then compare all of the
models, select the best one, train it on the full training set, and then evaluate
on the testing set. If we have 10 sets of hyperparameters and are using 5-Fold
CV, that represents 50 training loops.

Fortunately, as with most problems in machine learning, someone has solved

our problem and model tuning with K-Fold CV can be automatically
implemented in Scikit-Learn.

42. What is OOB (Out Of Bag) Error in Random Forest?

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of

measuring the prediction error of random forests, boosted decision trees, and
other machine learning models utilizing bootstrap aggregating (bagging) to sub-
sample data samples used for training.

Subsampling allows one to define an out-of-bag estimate of the prediction

performance improvement by evaluating predictions on those observations
which were not used in the building of the next base learner.

In Layman Term: When you train each tree in random forest, you will not use
all the samples. So for each bag, those unused samples can be used to find the
prediction error for that particular bag. The OOB error rate can then be
obtained by averaging the prediction error from all the bags.

www.learnbay.co
Machine Learning Practice Q&A set
43. What are different Hyperparameter tuning
involved in Random Forest?

The different hyperparameters involved in Random Forest are:

1. max_depth : The max_depth of a tree in Random Forest is defined as the

longest path between the root node and the leaf node. Using
the max_depth parameter, I can limit up to what depth I want every tree
in my random forest to grow.

2. min_sample_split: a parameter that tells the decision tree in a random

forest the minimum required number of observations in any given node
in order to split it.

3. max_leaf_nodes: This hyperparameter sets a condition on the splitting

of the nodes in the tree and hence restricts the growth of the tree. The
condition is based on the maximum number of leaf nodes allowed.

4. min_samples_leaf : This Random Forest hyperparameter specifies the

minimum number of samples that should be present in the leaf
node after splitting a node.

5. n_estimators: This is the number of trees you want to build before

taking the maximum voting or averages of predictions. Higher number of
trees give you better performance but makes your code slower.

6. max_sample (bootstrap sample): The max_samples hyperparameter

determines what fraction of the original dataset is given to any
individual tree.

7. max_features : This resembles the number of maximum features

provided to each tree in a random forest.

www.learnbay.co
Machine Learning Practice Q&A set
44. Random Forest Code Snippet:

…..

45. What is KNN Algorithm? How does it work?

The K Nearest Neighbors algorithm is a classification algorithm used in Data

Science and Machine Learning.

The goal is to classify a new data point/observation into one of multiple

existing categories. So, a number of neighbors ‘k’ is selected ( usually k = 5 ),
and the k closest data points are identified (either using Euclidean or
Manhattan distances)

Of the k closest data points (or ‘neighbors’), the category with the highest
number of k-close neighbors is the category assigned to the new data point.

www.learnbay.co
Machine Learning Practice Q&A set
Intuitively, this makes sense - a data point belongs in the category it’s most
similar to with respect to its features/properties. The most similar data points
are the ones that are nearest to that data point, if you visualize it on a graph.

1. On a new data point, a kNN will calculate it’s distance from every
single data point in our dataset. The most popular distance metric
used is the Euclidean Distance.
2. Once every single distance is calculated, the algorithm will pick
the K nearest data points.
3. For classification problems, it’ll make a prediction based on the class
of those k-nearest datapoints. In this context, it’s a good idea to
choose K as an odd number to avoid dies. The predicted class will be
the most frequent occuring class within K data points. For regression,
it can predict based on the mean or median of those data points.

46. What is “K” in KNN Algorithm?

K = Number of nearest neighbours you want to select to predict the class

of a given item

47. How to decide the value of K in KNN Algorithm?

Why is odd value of K preferable while choosing?

Short Answer:

If K is small, then results might not be reliable because noise will have a
higher influence on the result. If K is large, then there will be a lot of
processing which may adversely impact the performance of the algorithm.
So, following is must be considered while choosing the value of K:

a. K should be the square root of n (number of data points in training

dataset)
b. K should be odd so that there are no ties. If square root is even, then
add or subtract 1 to it.

K should be odd so that there are no ties in the voting. If square root of
number of data points is even, then add or subtract 1 to it to make it odd.

www.learnbay.co
Machine Learning Practice Q&A set

Long Answer:

There is no straightforward method to calculate the value of K in KNN. You

have to play around with different values to choose the optimal value of
K. Choosing a right value of K is a process called Hyperparameter Tuning.

The value of optimum K totally depends on the dataset that you are using.
The best value of K for KNN is highly data-dependent. In different
scenarios, the optimum K may vary. It is more or less hit and trail method.

You need to maintain a balance while choosing the value of K in KNN. K

should not be too small or too large.

A small value of K means that noise will have a higher influence on the
result.

Larger the value of K, higher is the accuracy. If K is too large, you are
under-fitting your model. In this case, the error will go up again. So, at the
same time you also need to prevent your model from under-fitting. Your
model should retain generalization capabilities otherwise there are fair
chances that your model may perform well in the training data but
drastically fail in the real data. Larger K will also increase the
computational expense of the algorithm.

There is no one proper method of estimation of K value in KNN. No

method is the rule of thumb but you should try considering following
suggestions:

1. Square Root Method: Take square root of the number of samples in the
training dataset.

2. Cross Validation Method: We should also use cross validation to find out
the optimal value of K in KNN. Start with K=1, run cross validation (5 to 10
fold), measure the accuracy and keep repeating till the results become
consistent.

K=1, 2, 3... As K increases, the error usually goes down, then stabilizes, and
then raises again. Pick the optimum K at the beginning of the stable zone.
This is also called Elbow Method.

www.learnbay.co
Machine Learning Practice Q&A set
3. Domain Knowledge also plays a vital role while choosing the optimum
value of K.

4. K should be an odd number.

I would suggest to try a mix of all the above points to reach any
conclusion.

48. Why is KNN Algorithm called a Lazy Learner? Can

we use KNN for large datasets?

When it gets the training data, it does not learn and make a model, it just
stores the data. It does not derive any discriminative function from the
training data. It uses the training data when it actually needs to do some
prediction. So, KNN does not immediately learn a model, but delays the
learning, that is why it is called lazy learner.

KNN works well with smaller dataset because it is a lazy learner. It needs
to store all the data and then makes decision only at run time. It needs to
calculate the distance of a given point with all other points. So if dataset is
large, there will be a lot of processing which may adversely impact the
performance of the algorithm.

KNN is also very sensitive to noise in the dataset. If the dataset is large,
there are chances of noise in the dataset which adversely affect the
performance of KNN algorithm. For each new data point, the kNN classifier
must:

1. Calculate the distances to all points in the training set and store
them
2. Sort the calculated distances
3. Store the K nearest points
4. Calculate the proportions of each class
5. Assign the class with the highest proportion

Obviously, this is a very taxing process, both in terms of time and space
complexity. The first operation is a quadratic time process, and the sorting

www.learnbay.co
Machine Learning Practice Q&A set
a O(n log n) process. Together, one could say that the process is a O(n³ log
n) process; a monstrously long process indeed.
Another problem is memory, since all pairwise distances must be stored
and sorted in memory on a machine. With very large datasets, local
machines will usually crash.

49. Discuss the advantages and disadvantages of KNN

Algorithm.

Advantages of KNN

1. No Training Period: KNN is called Lazy Learner (Instance based learning).

It does not learn anything in the training period. It does not derive any
discriminative function from the training data. In other words, there is no
training period for it. It stores the training dataset and learns from it only
at the time of making real time predictions. This makes the KNN algorithm
much faster than other algorithms that require training e.g. SVM, Linear
Regression etc.

2. Since the KNN algorithm requires no training before making

predictions, new data can be added seamlessly which will not impact the
accuracy of the algorithm.

3. KNN is very easy to implement. There are only two parameters required
to implement KNN i.e. the value of K and the distance function (e.g.
Euclidean or Manhattan etc.)

Disadvantages of KNN

1. Does not work well with large dataset: In large datasets, the cost of
calculating the distance between the new point and each existing points is
huge which degrades the performance of the algorithm.

2. Does not work well with high dimensions: The KNN algorithm doesn't
work well with high dimensional data because with large number of
dimensions, it becomes difficult for the algorithm to calculate the distance
in each dimension.

www.learnbay.co
Machine Learning Practice Q&A set
3. Need feature scaling: We need to do feature scaling (standardization
and normalization) before applying KNN algorithm to any dataset. If we
don't do so, KNN may generate wrong predictions.

4. Sensitive to noisy data, missing values and outliers: KNN is sensitive to

noise in the dataset. We need to manually impute missing values and
remove outliers.

50. What are Euclidean, Manhattan and Chebyshev

distance?

The Euclidean distance or Euclidean metric is the "ordinary" (i.e.straight-line)

distance between two points in Euclidean space.

The Manhattan distance, also known as rectilinear distance, city block

distance, taxicab metric is defined as the sum of the lengths of the projections
of the line segment between the points onto the coordinate axes.
In chess, the distance between squares on the chessboard for rooks is
measured in Manhattan distance.

The Chebyshev distance between two vectors or points p and q, with standard
coordinates and respectively, is:

It is also known as chessboard distance, since in the game of chess the

minimum number of moves needed by a king to go from one square on a

www.learnbay.co
Machine Learning Practice Q&A set
chessboard to another equals the Chebyshev distance between the centres of
the squares

51. How to handle categorical variables in KNN?

Create dummy variables out of a categorical variable and include them instead
of original categorical variable. Unlike regression, create k dummies instead of
(k-1).

www.learnbay.co
Machine Learning Practice Q&A set
For example, a categorical variable named “Department” has 5 unique levels /
categories. So we will create 5 dummy variables. Each dummy variable has 1
against its department and else 0.

52. Can KNN be used for Regression? How to use KNN

for Regression?

Yes, K-nearest neighbour can be used for regression. In other words, K-nearest
neighbour algorithm can be applied when dependent variable is continuous. In
this case, the predicted value is the average of the values of its k nearest
neighbours

53. Discuss the difference between KNN and K Means

Algorithms.

KNN and k-means clustering both are very different algorithms that solve
different problems and have their own meanings of what the variable ‘k’
is. KNN is a supervised classification algorithm that will label new data points
based on the ‘k’ number of nearest data points and k-means clustering is an
unsupervised clustering algorithm that groups the data into ‘k’ number of
clusters.

54. How to reduce increased variance of the model

other than changing k?
By using bagging-based decision boundary. If not restricted in the number of
times, one can draw samples from the original dataset, a simple variance
reduction method would be to sample, many times, and then simply take a
majority vote of the kNN models fit to each of these samples to classify each
test data point. This variance reduction method is called bagging.

55. What is the effect of sampling on KNN?

Sampling does several things in the perspective of a single data point, since
kNN works on a point-by-point basis.
1. The average distance to the k nearest neighbours increases due to
increased sparsity in the dataset.

www.learnbay.co
Machine Learning Practice Q&A set
2. Consequently, the area covered by k-nearest neighbours increases in
size and covers a larger area of the feature space.
3. The sample variance increases.
A consequence to this change in input is an increase in variance. When we talk
of variance, we refer to the variability in the predictions given different
samples from the population. Why would the immediate effects of sampling
lead to increased variance of the model?
Notice that now a larger area of the feature space is represented by the same k
data points. While our sample size has not grown, the population space that it
represents has increased in size. This will result in higher variance in the
proportion of classes in the k nearest data points, and consequently a higher
variance in the classification of each data point.

56. What happens when we change the value of K in

KNN?
Short Answer: The class boundaries of the predictions become more smooth
as k increases.

Long Answer: What really is the significance of these effects? First, it gives
hints that a lower k value makes the kNN model more “sensitive.” That is, it is
more sensitive to the local changes in the dataset. The “sensitivity” of the
model directly translates to its variance.

All of these examples point to an inverse relationship between variance and

k. Additionally, consider how kNN operates when k reaches its maximum
value, k=n, where n is the number of points in the training set) In this case, the
majority class in the training set will always dominate the predictions. It will
simply pick the most abundant class in the data, and never deviate, effectively
resulting in zero variance. Therefore, it seems to reduce variance, k must be
increased.

Final Verdict: In order to offset the increased variance due to sampling, k can
be increased to decrease model variance.

57. What is the thumb rule to approach KNN problem?

1. Load the data
2. Initialise the value of k

www.learnbay.co
Machine Learning Practice Q&A set
3. For getting the predicted class, iterate from 1 to total number of training
data points
1. Calculate the distance between test data and each row of training
data. Here we will use Euclidean distance as our distance metric since
it’s the most popular method. The other metrics that can be used are
Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance
values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class

58. KNN Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set
59. What is SVM Algorithm?
SVM stands for support vector machine, it is a supervised machine learning
algorithm which can be used for both Regression and Classification.
In this algorithm, we plot each data item as a point in n-dimensional space
(where n is number of features you have) with the value of each feature being
the value of a particular coordinate.
For example, if we only had two features like Height and Hair length of an
individual, we’d first plot these two variables in two-dimensional space where
each point has two co-ordinates (these co-ordinates are known as Support
Vectors)

Now, we will find some line that splits the data between the two differently
classified groups of data. This will be the line such that the distances from the
closest point in each of the two groups will be farthest away.

In the example shown above, the line which splits the data into two differently
classified groups is the black line, since the two closest points are the farthest
apart from the line. This line is our classifier. Then, depending on where the

www.learnbay.co
Machine Learning Practice Q&A set
testing data lands on either side of the line, that’s what class we can classify
the new data as.

60. What are support Vectors?

A support vector machine attempts to find the line that "best" separates two
classes of points. By "best", we mean the line that results in the largest
margin between the two classes. The points that lie on this margin are
the support vectors.
The vectors that define the hyperplane are the support vectors.

61. What is the purpose of the Support Vector in SVM?

A Support Vector Machine (SVM) performs classification by finding the
hyperplane that maximizes the distance margin between the two classes.
The extreme points in the data sets that define the hyperplane are the support
vectors

62. What are kernels?

SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the
required form. Different SVM algorithms use different types of kernel
functions. These functions can be different types.

There are four types of kernels in SVM.

1. Linear Kernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel

63. What is Kernel Trick?

Short Answer: It allows us to operate in the original feature space without
computing the coordinates of the data in a higher dimensional space.

www.learnbay.co
Machine Learning Practice Q&A set
Long Answer:

1. For a dataset with n features (~n-dimensional), SVMs find an n-1-

dimensional hyperplane to separate it (let us say for classification)
2. Thus, SVMs perform very badly with datasets that are not linearly
separable
3. But, quite often, it’s possible to transform our not-linearly-separable
dataset into a higher-dimensional dataset where it becomes linearly
separable, so that SVMs can do a good job
4. Unfortunately, quite often, the number of dimensions you have to add
(via transformations) depends on the number of dimensions you
already have (and not linearly)
a. For datasets with a lot of features, it becomes next to impossible
to try out all the interesting transformations
5. Enter the Kernel Trick
 Thankfully, the only thing SVMs need to do in the (higher-
dimensional) feature space (while training) is computing
the pair-wise dot products
 For a given pair of vectors (in a lower-dimensional feature
space) and a transformation into a higher-dimensional
space, there exists a function (The Kernel Function) which
can compute the dot product in the higher-dimensional
space without explicitly transforming the vectors into the
higher-dimensional space first
 We are saved!
6. SVM can now do well with datasets that are not linearly separable

64. Why is SVM called as Large Margin Classifier?

Short Answer: Because it places the decision boundary such that it maximizes
the distance between two clusters.

Long Answer: choosing the best hyperplane is to choose one in which the
distance from the training points is the maximum. This is formalized by
the geometric margin. Without getting into the details of the derivation, the
geometric margin is given by:

www.learnbay.co
Machine Learning Practice Q&A set

Which is simply the functional margin normalized. So, these intuitions lead to
the maximum margin classifier which is a precursor to the SVM.

65. What is the difference between Logistics

Regression and SVM? When to use which model?

1. SVM tries to find the “best” margin (distance between the line and the
support vectors) that separates the classes and this reduces the risk of
error on the data, while logistic regression does not, instead it can have
different decision boundaries with different weights that are near the
optimal point.

2. SVM works well with unstructured and semi-structured data like text
and images while logistic regression works with already identified
independent variables.

3. SVM is based on the geometrical properties of the data while logistic

regression is based on statistical approaches.

4. Logistic Regression can’t be applied to nonlinearly separable dataset

whereas SVM can be applied.

5. The risk of overfitting is less in SVM, while Logistic regression is

vulnerable to overfitting.

When to Use Logistic Regression vs Support Vector Machine?

Depending on the number of training sets (data)/features that you have, you
can choose to use either logistic regression or support vector machine.

Let’s take these as an example where:

www.learnbay.co
Machine Learning Practice Q&A set
n = number of features,
m = number of training examples

1. If n is large (1–10,000) and m is small (10–1000): use logistic regression or

SVM with a linear kernel.
2. If n is small (1–10 00) and m is intermediate (10–10,000): use SVM with
(Gaussian, polynomial etc) kernel
3. If n is small (1–10 00), m is large (50,000–1,000,000+): first, manually add
more features and then use logistic regression or SVM with a linear kernel

66. What does c and gamma parameter in SVM

signify?

Short Answer:
Cost and Gamma are the hyper-parameters that decide the performance of an
SVM model. There should be a fine balance between Variance and Bias for any
ML model. (this is a science and an art - as we call it in empirical studies)
For SVM, a High value of Gamma leads to more accuracy but biased results and
vice-versa. Similarly, a large value of Cost parameter (C) indicates poor
accuracy but low bias and vice-versa.
Following table summarizes the above explanation -

The art is to choose a model with optimum variance and bias. Therefore, you
need to choose the values of C and Gamma accordingly.
Optimum values of C and Gamma can be found by using methods like
Gridsearch.

Long Answer:

The C parameter tells the SVM optimization how much you want to avoid
misclassifying each training example. For large values of C, the optimization
will choose a smaller-margin hyperplane if that hyperplane does a better job of
getting all the training points classified correctly. Conversely, a very small value
of C will cause the optimizer to look for a larger margin separating hyperplane,
even if that hyperplane misclassifies more points. For very tiny values of C, you

www.learnbay.co
Machine Learning Practice Q&A set
should get misclassified examples, often even if your training data is linearly
separable.

The gamma parameter defines how far the influence of a single training
example reaches, with low values meaning ‘far’ and high values meaning
‘close’.

The gamma parameters can be seen as the inverse of the radius of influence of
samples selected by the model as support vectors. If gamma is too large, the
radius of the area of influence of the support vectors only includes the support
vector itself and no amount of regularization with C will be able to prevent
overfitting.

When gamma is very small, the model is too constrained and cannot capture
the complexity or “shape” of the data. The region of influence of any selected
support vector would include the whole training set. The resulting model will
behave similarly to a linear model with a set of hyperplanes that separate the
centres of high density of any pair of two classes.

67. What are Advantages and Disadvantages of SVM?

SVM Advantages
 SVM’s are very good when we have no idea on the data.
 Works well with even unstructured and semi structured data like text,
Images and trees.
 The kernel trick is real strength of SVM. With an appropriate kernel
function, we can solve any complex problem.
 Unlike in neural networks, SVM is not solved for local optima.
 It scales relatively well to high dimensional data.
 SVM models have generalization in practice, the risk of over-fitting is less
in SVM.
 SVM is always compared with ANN. When compared to ANN models,
SVMs give better results.

SVM Disadvantages
 Choosing a “good” kernel function is not easy.
 Long training time for large datasets.

www.learnbay.co
Machine Learning Practice Q&A set
 Difficult to understand and interpret the final model, variable weights
and individual impact.
 Since the final model is not so easy to see, we can not do small
calibrations to the model hence its tough to incorporate our business
logic.
 The SVM hyper parameters are Cost -C and gamma. It is not that easy to
fine-tune these hyper-parameters. It is hard to visualize their impact.

68. SVM Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set

…..
Naïve Bayes

69. What is Naïve Bayes Algorithm?

It is a classification algorithm that predicts the probability of each data point

belonging to a class and then classifies the point as the class with the highest
probability.

70. Discuss Bayes Theorem.

Bayes’ Theorem gives us the probability of an event actually happening by

combining the conditional probability given some result and the prior
knowledge of an event happening.

Conditional probability is the probability that something will happen, given

that something has a occurred. In other words, the conditional probability is

www.learnbay.co
Machine Learning Practice Q&A set
the probability of X given a test result or P(X|Test). For example, what is the
probability an e-mail is spam given that my spam filter classified it as spam.

The prior probability is based on previous experience or the percentage of

previous samples. For example, what is the probability that any email is spam.
Formally

 P(A|B) = Posterior probability = Probability of A given B happened

 P(B|A) = Conditional probability = Probability of B happening if A is true
 P(A) = Prior probability = Probability of A happening in general
 P(B) = Evidence probability = Probability of getting a positive test

71. Why is Naïve Bayes Naïve?

In Layman’s Term: The simple meaning of Naive is willing to believe that that
life is simple and fair, which is not true. Naive Bayes is naive because it
assumes that the features that are going into the model are not related to
each other anyhow Change in one variable will not affect the other variable
directly.

Long Answer: Naive Bayes (NB) is ‘naive’ because it makes the assumption that
features of a measurement are independent of each other. This is naive
because it is (almost) never true. Here is how it work even then - NB is a very
intuitive classification algorithm. It asks the question, “Given these features,
does this measurement belong to class A or B?”, and answers it by taking the
proportion of all previous measurements with the same features belonging to
class A multiplied by the proportion of all measurements in class A. If this
number is bigger than the corresponding calculation for class B then we say
the measurement belongs in class A.

72. What are feature matrix and response vectors?

www.learnbay.co
Machine Learning Practice Q&A set
 Feature matrix contains all the vectors(rows) of dataset in which each
vector consists of the value of dependent features.
 Response vector contains the value of class variable (prediction or
output) for each row of feature matrix.

73. Applications of Naïve Bayes Classification

Algorithms?

Some of real-world examples are as given below

 To mark an email as spam, or not spam?
 Classify a news article about technology, politics, or sports?
 Check a piece of text expressing positive emotions,
or negative emotions?
 Also used for face recognition software.

74. What are the Advantages and Disadvantages of

using Naïve Bayes Algorithm?

ADVANTAGES

1. Fast
2. Highly scalable.
3. Used for binary and Multi class Classification.
4. Great Choice for text classification.
5. Can easily train smaller data sets.

DISADVANTAGES

Naive Bayes considers that the features are independent of each other.
However, in real world, features depend on each other.

75. Naïve Bayes Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set

76. What is K-Means Clustering? What are the steps

for it?

K-means (Macqueen, 1967) is one of the simplest unsupervised learning

algorithms that solve the well-known clustering problem. K-means clustering is
a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining.

If k is given, the K-means algorithm can be executed in the following steps:

 Partition of objects into k non-empty subsets
 Identifying the cluster centroids (mean point) of the current partition.
 Assigning each point to a specific cluster
 Compute the distances from each point and allot points to the cluster
where the distance from the centroid is minimum.
 After re-allotting the points, find the centroid of the new cluster formed.

77. Why is the word “means” associated with the

name of K-Means algorithm?

www.learnbay.co
Machine Learning Practice Q&A set
The ‘means’ in the K-means refers to averaging of the data; that is, finding the
centroid.

There are k-medoids and k-medians algorithms as well.

k-medoids minimizes the sum of dissimilarities between points labelled

to be in a cluster and a point designated as the centre of that cluster. In
contrast to the k-means algorithm, k-medoids chooses datapoints as
centres (medoids or exemplars).

k-medians is a variation of k-means clustering where instead of

calculating the mean for each cluster to determine its centroid, one
instead calculates the median.

78. How to find the optimum number of clusters in K-

Means? Discuss the elbow curve/elbow method?

Basic idea behind partitioning methods, such as k-means clustering, is to define

clusters such that the total intra-cluster variation [or total within-cluster sum
of square (WSS)] is minimized. The total WSS measures the compactness of the
clustering and we want it to be as small as possible.

The Elbow method looks at the total WSS as a function of the number of
clusters: One should choose a number of clusters so that adding another
cluster doesn’t improve much better the total WSS.

www.learnbay.co
Machine Learning Practice Q&A set

Notice the elbow at k =3.

The optimal number of clusters can be defined as follow:

1. Compute clustering algorithm (e.g., k-means clustering) for different

values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (WSS).
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an
indicator of the appropriate number of clusters.

79. What is the difference between K-Means and

Hierarchical Clustering? When to use which?

Hierarchical Clustering and k-means clustering complement each other. In

hierarchical clustering, the researcher is not aware of the number of clusters to
be made whereas in k-means clustering, the number of clusters to be made are
specified before-hand.
Advice- If unaware about the number of clusters to be formed, use hierarchical

www.learnbay.co
Machine Learning Practice Q&A set
clustering to determine the number and then use k-means clustering to make
more stable clusters as hierarchical clustering is a single-pass exercise whereas
k-means is an iterative process.

80. What are the advantages and disadvantages of

using K-Means Algorithms?

K-Means Advantages :

1) If variables are huge, then K-Means most of the times computationally

faster than hierarchical clustering, if we keep k smalls.

2) K-Means produce tighter clusters than hierarchical clustering, especially if

the clusters are globular.

K-Means Disadvantages :

1) Difficult to predict K-Value.

2) With global cluster, it didn't work well.
3) Different initial partitions can result in different final clusters.
4) It does not work well with clusters (in the original data) of Different size
and Different density

www.learnbay.co
Machine Learning Practice Q&A set

81. K-Means Code Snippet:

82. What is Hierarchical Clustering?

Hierarchical clustering is another unsupervised learning algorithm that is used
to group together the unlabelled data points having similar characteristics.
Hierarchical clustering algorithms falls into following two categories.
Agglomerative hierarchical algorithms − In agglomerative hierarchical
algorithms, each data point is treated as a single cluster and then successively

www.learnbay.co
Machine Learning Practice Q&A set
merge or agglomerate (bottom-up approach) the pairs of clusters. The
hierarchy of the clusters is represented as a dendrogram or tree structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical
algorithms, all the data points are treated as one big cluster and the process of
clustering involves dividing (Top-down approach) the one big cluster into
various small clusters.

83. What are the steps to perform Agglomerative

Hierarchical Clustering?
Most used and important Hierarchical clustering i.e. agglomerative. The steps
to perform the same is as follows −
 Step 1 − Treat each data point as single cluster. Hence, we will be having,
say K clusters at start. The number of data points will also be K at start.
 Step 2 − Now, in this step we need to form a big cluster by joining two
closet datapoints. This will result in total of K-1 clusters.
 Step 3 − Now, to form more clusters we need to join two closet clusters.
This will result in total of K-2 clusters.
 Step 4 − Now, to form one big cluster repeat the above three steps until
K would become 0 i.e. no more data points left to join.
 Step 5 − At last, after making one single big cluster, dendrograms will be
used to divide into multiple clusters depending upon the problem.

84. What is Dendogram and what is its importance in

Hierarchical Clustering?
A dendrogram is a type of Tree Diagram showing hierarchical clustering —
relationships between similar sets of data. They are frequently used in biology
to show clustering between genes or samples, but they can represent any type
of grouped data.
The role of dendrogram starts once the big cluster is formed. Dendrogram will
be used to split the clusters into multiple cluster of related data points
depending upon our problem.

Parts of Dendogram:

www.learnbay.co
Machine Learning Practice Q&A set

85. Hierarchical Clustering Code Snippet:

…..

www.learnbay.co
Machine Learning Practice Q&A set
Gradient Boosting Algorithms

a. Adaboost
b. GBM
c. XGBoost

86. What is Boosting?

Boosting is a method of converting weak learners into strong learners. In

boosting, each new tree is a fit on a modified version of the original data set.

Purpose of Boosting: It helps the weak learner to be modified to become

better.

How it evolved: The first Boosting Algorithm gained popularity was AdaBoost
or Adaptive Boosting. Further it evolved and generalised as Gradient Boosting.

87. What is Adaboost?

Adaboost combines multiple weak learners into a single strong learner. The
weak learners in AdaBoost are decision trees with a single split, called decision
stumps. When AdaBoost creates its first decision stump, all observations are
weighted equally. To correct the previous error, the observations that were
incorrectly classified now carry more weight than the observations that were
correctly classified. AdaBoost algorithms can be used for both classification
and regression problem.

88. Adaboost Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set

89. What is Gradient Boosting Method (GBM)?

Gradient Boosting works by sequentially adding predictors to an ensemble,

each one correcting its predecessor. However, instead of changing the weights
for every incorrect classified observation at every iteration like AdaBoost,
Gradient Boosting method tries to fit the new predictor to the residual errors
made by the previous predictor.

GBM uses Gradient Descent to find the shortcomings in the previous learner’s
predictions. GBM algorithm can be given by following steps.
Fit a model to the data, F1(x) = y
Create a new model, F2(x) = F1(x) + h1(x)
By combining weak learner after weak learner, our final model is able to
account for a lot of the error from the original model and reduces this error
over time.

90. Gradient Boosting Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set

91. What is XGBoost?

XGBoost stands for eXtreme Gradient Boosting. XGBoost is an

implementation of gradient boosted decision trees designed for speed and
performance. Gradient boosting machines are generally very slow in
implementation because of sequential model training. Hence, they are not
very scalable. Thus, XGBoost is focused on computational speed and model
performance. XGBoost provides:

 Parallelization of tree construction using all of your CPU cores during

training.
 Distributed Computing for training very large models using a cluster of
machines.
 Out-of-Core Computing for very large datasets that don’t fit into
memory.
 Cache Optimization of data structures and algorithm to make the best
use of hardware.

www.learnbay.co
Machine Learning Practice Q&A set

92. XGBoost Code Snippet:

93. What are basic enhancements done to Gradient

Boosting?
Gradient boosting is a greedy algorithm and can overfit a training dataset
quickly. It can benefit from regularization methods that penalize various parts
of the algorithm and generally improve the performance of the algorithm by
reducing overfitting.

We will look at 4 enhancements to basic gradient boosting:

1. Tree Constraints
2. Shrinkage
3. Random sampling
4. Penalized Learning
1. Tree Constraints: A good general heuristic is that the more constrained tree
creation is, the more trees you will need in the model, and the reverse, where
less constrained individual trees, the fewer trees that will be required.

www.learnbay.co
Machine Learning Practice Q&A set
Below are some constraints that can be imposed on the construction of
decision trees:
Number of trees, generally adding more trees to the model can be very slow
to overfit. The advice is to keep adding trees until no further improvement is
observed.
Tree depth, deeper trees are more complex trees and shorter trees are
preferred. Generally, better results are seen with 4-8 levels.
Number of nodes or number of leaves, like depth, this can constrain the size of
the tree, but is not constrained to a symmetrical structure if other constraints
are used.
Number of observations per split imposes a minimum constraint on the
amount of training data at a training node before a split can be considered
Minimum improvement to loss is a constraint on the improvement of any split
added to a tree.

2. Weighted Updates: The predictions of each tree are added together

sequentially. The contribution of each tree to this sum can be weighted to slow
down the learning by the algorithm. This weighting is called a shrinkage or a
learning rate.

3. Stochastic Gradient Boosting

A big insight into bagging ensembles and random forest was allowing trees to
be greedily created from subsamples of the training dataset.This same benefit
can be used to reduce the correlation between the trees in the sequence in
gradient boosting models.This variation of boosting is called stochastic
gradient boosting.

At each iteration a subsample of the training data is drawn at random (without

replacement) from the full training dataset. The randomly selected subsample
is then used, instead of the full sample, to fit the base learner.

4. Penalized Gradient Boosting

Additional constraints can be imposed on the parameterized trees in addition

to their structure.

Classical decision trees like CART are not used as weak learners, instead a
modified form called a regression tree is used that has numeric values in the
leaf nodes (also called terminal nodes). The values in the leaves of the trees
can be called weights in some literature.As such, the leaf weight values of the
trees can be regularized using popular regularization functions, such as:

www.learnbay.co
Machine Learning Practice Q&A set

L1 regularization of weights.
L2 regularization of weights.

The additional regularization term helps to smooth the final learnt weights to
avoid over-fitting. Intuitively, the regularized objective will tend to select a
model employing simple and predictive functions.

…..
Dimensionality Reduction Algorithms/Techniques

94. What is Dimensionality Reduction? Why is it used?

Dimensionality reduction refers to the process of converting a set of data. That
data needs to having vast dimensions into data with lesser dimensions. Also, it
needs to ensure that it conveys similar information concisely.

Although, we use these techniques to solve machine learning problems. And

problem is to obtain better features for a classification or regression task.

95. What are the commonly used Dimensionality

Reduction Techniques?
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)

96. How does PCA work? When to use?

Short Answer: Principal Component Analysis (PCA) is an unsupervised, non-

parametric statistical technique primarily used for dimensionality reduction in
machine learning.

High dimensionality means that the dataset has a large number of features. The
primary problem associated with high dimensionality in the machine learning

www.learnbay.co
Machine Learning Practice Q&A set
field is model overfitting, which reduces the ability to generalize beyond the
examples in the training set.

PCA in Layman’s Term: Consider the 2D XY plane.

For the sake of intuition, let us consider variance as the spread of data -
distance between the two farthest points.

Assumption:
Typically, it is believed, that if the variance of data is large, it offers more
information, than data which has small variance. (This may or may not be
true). This is the assumption which PCA intends to exploit.

I give you 4 points - {(1,1), (2,2), (3,3), (4,4)}

(all lie on the line X=Y)

What is the variance on X-axis?

Variance(X) = 4-1 = 3

What is the variance on Y-axis?

Variance(Y) = 4-1 = 3

Can we obtain new data with higher variance in some manner?

Rotate your XY system by 45 degrees anticlockwise. What happens? The line
X=Y has now become the X(new)-axis. And, X = -Y is now the Y(new)-axis. Let's
compute the variance again (in the form of distance)

Variance(X(new)) = distance ((4,4), (1,1)) = sqrt(18) = 4.24

Variance(Y(new)) =requires some calculations.

What did we get by doing this rotation?

Original data - had highest variance on any axis as 3. This rotation gave us a
variance of 4.24

That was the intuitive explanation of what PCA does. Just for further
clarification

Eigenvalues = variance of the data along a particular axis in the new coordinate
system. In above example, Eigenvalue(X(new)) = 4.24.

www.learnbay.co
Machine Learning Practice Q&A set
Eigenvectors = the vectors which represent the new coordinate system. In
above example, vector [1,1], would be an eigenvector for X(new), and [1,-1]
eigenvector for Y(new). Since they are just directions - solvers typically give us
unit vectors.

Getting transformed data

Once you have the eigenvectors, a dot product of the eigenvector with the
original point will give you the new point in the new coordinate system.

Diagnolization: This is the part where you equate covariance to lambda*I. This
is basically trying to find an eigenvector, such that all points would lie on the
same line, and thus it will have only elements of variance, and covariance
terms would be zero.

Steps of PCA:

1. Calculate the covariance matrix X of data points.

2. Calculate eigenvectors and correspond eigenvalues.
3. Sort eigenvectors accordingly to their given value in decrease order.
4. Choose first k eigenvectors and that will be the new k dimensions.
5. Transform the original n-dimensional data points into k-dimensions

97. PCA Code Snippet:

www.learnbay.co
Machine Learning Practice Q&A set

98. How does LDA work? When to use?

LDA is a way to reduce 'dimensionality' while at the same time preserving as
much of the class discrimination information as possible.

How does it work?

www.learnbay.co
Machine Learning Practice Q&A set
Basically, LDA helps you find the 'boundaries' around clusters of classes. It
projects your data points on a line so that your clusters 'are as separated as
possible', with each cluster having a relative (close) distance to a centroid.
What was that stuff about dimensionality?

Let's say you have a group of data points in 2 dimensions, and you want to
group them into 2 groups. LDA reduces the dimensionality of your set like so:
K(Groups) = 2. 2-1 = 1.

Why? Because "The K centroids lie in an at most K-1-dimensional affine

subspace". What is the affine subspace? It’s a geometric concept or
*structure* that says, "I am going to generalize the affine properties of
Euclidean space". What are those affine properties of the Euclidean space?
Basically, it’s the fact that we can represent a point with 3 coordinates in a 3-
dimensional space (with a nod toward the fact that there may be more than 3
dimensions that we are ultimately dealing with).

So, we should be able to represent a point with 2 coordinates in 2 dimensional

space and represent a point with 1 coordinate in a 1 dimensional space. LDA
reduced our
dimensionality of our 2-dimension problem down to one dimension. So now
we can get down to the serious business of listening to the data. We now have
2 groups, and 2 points in any dimension can be joined by a line. How many
dimensions does a line have? 1! Now we are cooking with Crisco!
So we get a bunch of these data points, represented by their 2d representation
(x,y). We are going to use LDA to group these points into either group 1 or
group 2.

99. What are the Steps for LDA?

Steps of LDA:
1. Compute the d-dimensional mean vector for the different classes from the
dataset.
2. Compute the Scatter matrix (in between class and within the class scatter
matrix)
3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector
with the largest eigenvalue to from a d x k dimensional matrix w (where every
column represents an eigenvector)
4. Used d * k eigenvector matrix to transform the sample onto the new
subspace.

www.learnbay.co
Machine Learning Practice Q&A set
This can be summarized by the matrix multiplication.
Y = X x W (where X is a n * d dimension matrix representing the n samples
and you are transformed n * k dimensional samples in the new subspace.

100. LDA Code Snippet:

101. What is GDA?

When we have a classification problem in which the input features are

continuous random variable, we can use GDA, it’s a generative learning
algorithm in which we assume p(x|y) is distributed according to a multivariate
normal distribution and p(y) is distributed according to Bernoulli.

www.learnbay.co
Machine Learning Practice Q&A set

Gaussian discriminant analysis (GDA) is a generative model for classification

where the distribution of each class is modeled as a multivariate Gaussian.

102. What are advantages and disadvantages of

Dimensionality Reduction?
Advanatges:
 Dimensionality Reduction helps in data compression, and hence reduced
storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
 Dimensionality Reduction helps in data compressing and reducing the
storage space required
 It fastens the time required for performing same computations.
 If there present fewer dimensions then it leads to less computing. Also,
dimensions can allow usage of algorithms unfit for a large number of
dimensions.
 It takes care of multicollinearity that improves the model performance.
It removes redundant features. For example, there is no point in storing
a value in two different units (meters and inches).
 Reducing the dimensions of data to 2D or 3D may allow us to plot and
visualize it precisely. You can then observe patterns more clearly.

Disadvantages:

 Basically, it may lead to some amount of data loss.

 Although, PCA tends to find linear correlations between variables, which
is sometimes undesirable.
 Also, PCA fails in cases where mean and covariance are not enough to
define datasets.
 Further, we may not know how many principal components to keep- in
practice, some thumb rules are applied.

…..

www.learnbay.co

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Unit 1-Week2: Linear Regression, Bias, Variance, Under and Over Fitting, Curse of Dimensionality and ROC
No ratings yet
Unit 1-Week2: Linear Regression, Bias, Variance, Under and Over Fitting, Curse of Dimensionality and ROC
53 pages
Top 100 ML Interview Q&A
100% (1)
Top 100 ML Interview Q&A
39 pages
unit-1.2-Perceptron-2024
No ratings yet
unit-1.2-Perceptron-2024
107 pages
ML Decode
No ratings yet
ML Decode
130 pages
ML Decode
No ratings yet
ML Decode
130 pages
Regularization Linear Models
No ratings yet
Regularization Linear Models
23 pages
Chapter2 1 22
No ratings yet
Chapter2 1 22
9 pages
Linear Regression
No ratings yet
Linear Regression
60 pages
Machine Learning-2
No ratings yet
Machine Learning-2
87 pages
2. Linear Regression, Polynomical, Gradiant Descent
No ratings yet
2. Linear Regression, Polynomical, Gradiant Descent
42 pages
Machine Learning Volume I 280820241047
No ratings yet
Machine Learning Volume I 280820241047
4 pages
Machine Learning
No ratings yet
Machine Learning
11 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Lecture 4 and 5
No ratings yet
Lecture 4 and 5
17 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
ML Unit 3
No ratings yet
ML Unit 3
2 pages
Forecasting and Learning Theory
No ratings yet
Forecasting and Learning Theory
46 pages
[Technical] Machine Learning U3-6 [2019 Pattern]
No ratings yet
[Technical] Machine Learning U3-6 [2019 Pattern]
101 pages
Bias Varience Trade Off
100% (2)
Bias Varience Trade Off
35 pages
ML Unit 3
No ratings yet
ML Unit 3
23 pages
CMPE257 - W2C3 - ML Fundamentals_ Part 2
No ratings yet
CMPE257 - W2C3 - ML Fundamentals_ Part 2
34 pages
Deep Learning[1]
No ratings yet
Deep Learning[1]
26 pages
Session 3
No ratings yet
Session 3
26 pages
ML Summary PDF
No ratings yet
ML Summary PDF
5 pages
module 3 modified
No ratings yet
module 3 modified
48 pages
Regression
No ratings yet
Regression
45 pages
Linear Regression Summary
No ratings yet
Linear Regression Summary
57 pages
ML 22-23 Sem, GPT
No ratings yet
ML 22-23 Sem, GPT
14 pages
Lec 3 Regression.
No ratings yet
Lec 3 Regression.
20 pages
12 Bias-Variance_Underfit_overfit
No ratings yet
12 Bias-Variance_Underfit_overfit
4 pages
Machine Learning Questions and Answers For Interview
No ratings yet
Machine Learning Questions and Answers For Interview
20 pages
ABDUA 3 and 4
No ratings yet
ABDUA 3 and 4
102 pages
Unit-Vi 2
No ratings yet
Unit-Vi 2
31 pages
Lecture 4
No ratings yet
Lecture 4
18 pages
ML 21-22 Sem
No ratings yet
ML 21-22 Sem
10 pages
ML Assignment
No ratings yet
ML Assignment
5 pages
DL_Unit1 (1)
100% (1)
DL_Unit1 (1)
79 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Unit 4
No ratings yet
Unit 4
50 pages
Merge +1
No ratings yet
Merge +1
107 pages
vsat2k_ML_Ch1a Evaluation of Learning Algorithms - Jan 2025
No ratings yet
vsat2k_ML_Ch1a Evaluation of Learning Algorithms - Jan 2025
19 pages
Machine Learning Math Essentials _12.02.2025
No ratings yet
Machine Learning Math Essentials _12.02.2025
88 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 7
11 pages
Linear Regression
No ratings yet
Linear Regression
37 pages
Unit 2
No ratings yet
Unit 2
97 pages
DADS303 - MBA 3 - Machine - Learning
No ratings yet
DADS303 - MBA 3 - Machine - Learning
11 pages
Lecture 02
No ratings yet
Lecture 02
43 pages
d3 It Ml Jan 2023 Part 2
No ratings yet
d3 It Ml Jan 2023 Part 2
32 pages
1 - Intro to Machine Learning
No ratings yet
1 - Intro to Machine Learning
34 pages
Classification & Regression BDMDM Print
No ratings yet
Classification & Regression BDMDM Print
5 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
ML Short
No ratings yet
ML Short
11 pages
Machine Learning Models
No ratings yet
Machine Learning Models
52 pages
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Top 25 Interview Questions On RNN - Reader View
No ratings yet
Top 25 Interview Questions On RNN - Reader View
9 pages
Homeopresc
No ratings yet
Homeopresc
2 pages
Top 170 Machine Learning Interview Questions and Answers (2024) - Reader View
No ratings yet
Top 170 Machine Learning Interview Questions and Answers (2024) - Reader View
51 pages
Class 5 Memory Allocaion
No ratings yet
Class 5 Memory Allocaion
15 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Linear Regression
No ratings yet
Linear Regression
59 pages
Text Analytics - Capstone Project
No ratings yet
Text Analytics - Capstone Project
19 pages
Hive
No ratings yet
Hive
37 pages
Introduction To Machine Learning: TP 7 - Random Forest, Bagging and Boosting
No ratings yet
Introduction To Machine Learning: TP 7 - Random Forest, Bagging and Boosting
2 pages
Day 1 Special Bonus
No ratings yet
Day 1 Special Bonus
23 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
unit 2 ml
No ratings yet
unit 2 ml
47 pages
Predicting & Optimizing Airlines Customer Satisfaction Using Clas
No ratings yet
Predicting & Optimizing Airlines Customer Satisfaction Using Clas
84 pages
Random Forest
No ratings yet
Random Forest
18 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
TEITA170 Saral Mane Seminar Report
No ratings yet
TEITA170 Saral Mane Seminar Report
23 pages
Customer Behavior Analysis Using Naive Bayes With Bagging Homogeneous Feature Selection Approach
No ratings yet
Customer Behavior Analysis Using Naive Bayes With Bagging Homogeneous Feature Selection Approach
12 pages
Well-Scale Demo of DPS Using DAS & DTS
No ratings yet
Well-Scale Demo of DPS Using DAS & DTS
18 pages
Housing Price Prediction
No ratings yet
Housing Price Prediction
87 pages
AI and Robotics Complete practice set final - converted
No ratings yet
AI and Robotics Complete practice set final - converted
12 pages
Comparison of Random Forest and Gradient Boosting Fingerprints To Enhance An Outdoor Radio-Frequency Localization System
No ratings yet
Comparison of Random Forest and Gradient Boosting Fingerprints To Enhance An Outdoor Radio-Frequency Localization System
11 pages
Advanced Analytics For Train Delay Prediction Systems by Including Exogenous Weather Data
No ratings yet
Advanced Analytics For Train Delay Prediction Systems by Including Exogenous Weather Data
10 pages
Pearson Correlation Coefficient-Based Performance Enhancement of Broad Learning System For Stock Price Prediction
No ratings yet
Pearson Correlation Coefficient-Based Performance Enhancement of Broad Learning System For Stock Price Prediction
6 pages
Ensemble Classifiers
No ratings yet
Ensemble Classifiers
37 pages
Data Mining Classification - Basic Concepts and Techniques
No ratings yet
Data Mining Classification - Basic Concepts and Techniques
92 pages
Comparison of Machine Learning Models To Provide Preliminary Forecasts of Real Estate Prices
No ratings yet
Comparison of Machine Learning Models To Provide Preliminary Forecasts of Real Estate Prices
36 pages
AIA 6600 Module 5
No ratings yet
AIA 6600 Module 5
14 pages
Master Machine Learning Algorithms Discover How They Work And Implement Them From Scratch 11 Jason Brownlee instant download
100% (1)
Master Machine Learning Algorithms Discover How They Work And Implement Them From Scratch 11 Jason Brownlee instant download
56 pages
Start Here With Machine Learning
No ratings yet
Start Here With Machine Learning
25 pages
A Deep Learning Approach Towards Student Performance Prediction in Online Courses Challenges Based On A Global Perspective
No ratings yet
A Deep Learning Approach Towards Student Performance Prediction in Online Courses Challenges Based On A Global Perspective
6 pages
Immediate download Deep Learning Powered Technologies Autonomous Driving Artificial Intelligence of Things AIoT Augmented Reality 5G Communications and Beyond on Engineering Science and Technology Khaled Salah Mohamed ebooks 2024
100% (2)
Immediate download Deep Learning Powered Technologies Autonomous Driving Artificial Intelligence of Things AIoT Augmented Reality 5G Communications and Beyond on Engineering Science and Technology Khaled Salah Mohamed ebooks 2024
65 pages
Enhancing The Prediction of Student Performance Based On The Machine Learning XGBoost Algorithm
No ratings yet
Enhancing The Prediction of Student Performance Based On The Machine Learning XGBoost Algorithm
21 pages
6. (Mirza 2022) Inflation Prediction in Emerging Economies ML and FX Reserve Integration for Enhanced Forecasting
No ratings yet
6. (Mirza 2022) Inflation Prediction in Emerging Economies ML and FX Reserve Integration for Enhanced Forecasting
11 pages
Cyber empower (1)
No ratings yet
Cyber empower (1)
7 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
AIL Quiz Loc
No ratings yet
AIL Quiz Loc
33 pages
PE IV - Practical Machine Learning
No ratings yet
PE IV - Practical Machine Learning
7 pages
Applied Artificial Intelligence For Predicting Construction Projects Delay
No ratings yet
Applied Artificial Intelligence For Predicting Construction Projects Delay
16 pages

Machine Learning Interview Question

Uploaded by

Machine Learning Interview Question

Uploaded by

Machine Learning Practice Q&A set

1. What is Machine Learning?

2. What are some use cases of Machine Learning from our

1. YouTube/Netflix/Amazon Prime: Recommends us videos or movies or

3. What are different kinds of Machine Learning? What

The three kinds of learning in Machine Learning are:

4. What are different steps in Machine Learning?

5. What is Bias and Variance? How can we have

In supervised learning, underfitting happens when a model unable to capture

Bias Variance Trade-off:

a. By minimizing total error

6. What is Linear Regression?

A linear regression line has an equation of the form

7. What are the assumptions of Linear Regression?

L : Linearity ( Relationship between x and y is linear)

1. The assumption about the relationship between the dependent and

 Normality assumption: It is assumed that the error terms, ε(i), are

 Constant variance assumption: It is assumed that the residual

3. Assumptions about the estimators:

(Explanation: If the independent variables are not linearly independent

Regularization is a technique that is used to solve the overfitting problem of

The types of Regularization are as follows:

 The L1 regularization (also called Lasso)

L1 Regularization: L1 Regularization or Lasso Regularization adds a penalty to

Where λ is the tuning parameter called the regularization parameter which

L2 Regularization: L2 Regularization or Ridge Regularization also adds a

Where λ is the tuning parameter called the regularization parameter which

Elastic-net Regularization: Elastic-net is a mix of both L1 and L2

9. How to choose the value of the regularisation

10. Explain gradient descent?

coefficient = coefficient – (alpha * delta)

11. What is Learning Rate? How to choose the value of

12. How to carry out hypothesis testing in linear

1. To check whether a predictor is significant for the prediction of the target

13. What is Variance Inflation Factor (VIF)? What is its

Variance Inflation Factor (VIF) is used to detect the presence of multicollinearity.

It is obtained by regressing each independent variable, say X on the remaining

Rule of thumb is to avoid any variable with VIF > 5.

A residual plot lets you see if your data appears homoscedastic.

15. What is difference between R-Squared and

Hence, if you are building Linear regression on multiple variable, it is always

16. What are forecast KPI’s or metrics? Explain Bias

RMSE: It is defined as the square root of the average squared error.

(In all above equations, et is an error term)

17. What is Multicollinearity and what to do with it?

If multicollinearity is a problem in your model -- if the VIF for a factor is near or

 Use Principal Components Analysis, regression methods that cut the

18. What is Logistic Regression? What is Logistics

Logistic regression is a supervised learning classification algorithm used to

Logistics Function/Sigmoid Function: The Logistic Function also called the

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

19. Why is logistics regression despite being a

Although the dependent variable in logistic regression is binary, the predicted

20. What is Maximum Likelihood?

It is a method in statistics for estimating parameter(s) of a model for given

Step 1: Make an assumption about the data generating function.

Step 3: Find an estimator for the parameter using an optimization technique.

Step 1: (Assumption): The coin follows the Bernoulli distribution function.

Step 2: (Likelihood Function): The likelihood function is binomial distribution

21. What is odds ratio?

If the probability of something happening is p, the odds-ratio is given by p/(1-

So, p(A) = p = 1/3 and hence odds of A = (1/3)/(2/3) = 1/2

22. Can we use Linear Regression in place of Logistics

23. What are the assumptions of Logistics Regression?

1. The logistic regression assumes that there is minimal or no multicollinearity

3. The logistic regression usually requires a large sample size to predict

5. The Logistic regression assumes the observations to be independent of

Decision boundary helps to differentiate probabilities into positive class and

Step1 : Sigmoid gave the output in the range 0 to 1.

Step2: Set a threshold probability (assume 0.5)

(In the above example, 0.5 is a linear decision boundary).

25. Which cost function is used in Logistics Regression?

The cost function used in Logistics Regression is Sigmoid Function or Logistic

26. Can logistics regression handle categorical

The various levels of a categorical variable will be assigned a unique numeric

27. How does Logistics Regression deals with

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))