Machine Learning Interview Question
Machine Learning Interview Question
There are Machine Learning use cases in almost every industry today.
1. Supervised Learning
2. Unsupervised Learning
3. Semi-supervised Learning
4. Reinforcement Learning
www.learnbay.co
Machine Learning Practice Q&A set
1. Collect data
2. Filter data
3. Analyse data
4. Train algorithms
5. Test algorithms
6. Use algorithms for future predictions
Bias is the difference between the average prediction of our model and the
correct value which we are trying to predict. Model with high bias pays very
little attention to the training data and oversimplifies the model. It always
leads to high error on training and test data.
Variance is the variability of model prediction for a given data point or a value
which tells us spread of our data. Model with high variance pays a lot of
attention to training data and does not generalize on the data which it hasn’t
seen before. As a result, such models perform very well on training data but
has high error rates on test data.
www.learnbay.co
Machine Learning Practice Q&A set
low variance. It happens when we have very less amount of data to build an
accurate model or when we try to build a linear model with a nonlinear data.
Also, these kinds of models are very simple to capture the complex patterns in
data like Linear and logistic regression.
In supervised learning, overfitting happens when our model captures the noise
along with the underlying pattern in data. It happens when we train our model
a lot over noisy dataset. These models have low bias and high variance. These
models are very complex like Decision trees which are prone to overfitting.
For any supervised algorithm, having a high bias error usually means it has low
variance error and vice versa. To be more specific, parametric or linear ML
algorithms often have a high bias but low variance. On the other hand, non-
parametric or non-linear algorithms have vice versa.
The goal of any ML model is to obtain a low variance and a low bias state,
which is often a task due to the parametrization of machine learning
algorithms. Common ways to achieve optimum Bias and Variance are:
www.learnbay.co
Machine Learning Practice Q&A set
……………………………….
The case of one explanatory variable is called simple linear regression, for
more than one explanatory variable, the process is called multiple linear
regression.
Long Answer:
There are three main assumptions in a linear regression model:
www.learnbay.co
Machine Learning Practice Q&A set
2. Assumptions about the residuals:
(Explanation: because the observations with larger errors will have more
pull or influence on the fitted model.)
www.learnbay.co
Machine Learning Practice Q&A set
8. What is Regularization? Explain different types of
Regularizations.
www.learnbay.co
Machine Learning Practice Q&A set
Lambda is a shared penalization parameter while alpha sets the ratio between
L1 and L2 regularization in the Elastic Net Regularization. Hence, we expect a
hybrid behavior between L1 and L2 regularization.
On the other hand, if the value of λ is 0 (very small), the model will tend to
overfit the training data (low bias – high variance).
There is no proper way to select the value of λ. What you can do is have a sub-
sample of data and run the algorithm multiple times on different sets. Here,
the person has to decide how much variance can be tolerated. Once the user is
satisfied with the variance, that value of λ can be chosen for the full dataset.
One thing to be noted is that the value of λ selected here was optimal for that
subset, not for the entire training data.
(Note: Cost function is the average of the loss functions for all the training
examples.)
www.learnbay.co
Machine Learning Practice Q&A set
When it is used: Gradient descent is best used when the parameters cannot be
calculated analytically (e.g. using linear algebra) and must be searched for by
an optimization algorithm.
Details: The goal of any Machine Learning Model is minimize the cost function.
To get the minima of the cost function we use Gradient Descent Algorithm.
Without going too much mathematical, let’s see the steps involved in Gradient
Descent.
Step 1: Initialize the coefficient of the function. in the initial value should either
be 0 a or very small value.
coefficient = 0.0
Step 2: For the coefficient, we have chosen in step 1, we will calculate the cost
by putting it in function.
cost = f(coefficient)
Step 3: Find the derivative(delta) of the cost or the derivative of the function
delta = derivative(cost)
Step 4: Now that we know from the derivative which direction is downhill, we
can now update the coefficient values. A Learning Rate (alpha) must be
specified that controls how much the coefficients can change on each update.
This process is repeated until the cost of the coefficients (cost) is 0.0 or close
enough to zero to be good enough, or the number of epochs is reached.
www.learnbay.co
Machine Learning Practice Q&A set
Selecting the value of learning rate is a tricky business. If the value is too small,
the gradient descent algorithm takes ages to converge to the optimal solution.
On the other hand, if the value of the learning rate is high, the gradient
descent will overshoot the optimal solution and most likely never converge to
the optimal solution.
To overcome this problem, you can try different values of alpha over a range of
values and plot the cost vs the number of iterations. Then, based on the
graphs, the value corresponding to the graph showing the rapid decrease can
be chosen.
If you see that the cost is increasing with the number of iterations, your
learning rate parameter is high and it needs to be decreased.
www.learnbay.co
Machine Learning Practice Q&A set
regression coefficients is inflated as compared to when the predictor variables
are not linearly related.
higher the VIF, higher the R-Squared which means the variable X is collinear
with Y and Z variables. If all the variables are completely orthogonal, R-Square
will be 0 resulting in VIF of 1.
14. How does Residual Plot and Q-Q plot help in linear
regression model?
Residual plots and Q-Q plots are used to visually check that your data meets
the homoscedasticity and normality assumptions of linear regression.
Q-Q plots let you check that the data meet the assumption of normality. They
compare the distribution of your data to a normal distribution by plotting the
quartiles of your data against the quartiles of a normal distribution. If your
data are normally distributed, then they should form an approximately straight
line.
www.learnbay.co
Machine Learning Practice Q&A set
Long Answer: R-squared or R2 explains the degree to which your input
variables explain the variation of your output / predicted variable. So, if R-
square is 0.8, it means 80% of the variation in the output variable is explained
by the input variables. So, in simple terms, higher the R squared, the more
variation is explained by your input variables and hence better is your model.
However, the problem with R-squared is that it will either stay the same or
increase with addition of more variables, even if they do not have any
relationship with the output variables. This is where “Adjusted R square”
comes to help. Adjusted R-square penalizes you for adding variables which do
not improve your existing model. Adjusted R2 also explains the degree to
which your input variables explain the variation of your output / predicted
variable but adjusts for the number of terms in a model. If you add more and
more useless variables to a model, adjusted r-squared will decrease. If you add
more useful variables, adjusted r-squared will increase.
Adjusted R2 will always be less than or equal to R2. So, basically in adjusted R2
we include only the important and useful variables, whereas in R2 all the
variables are included.
Typically, the more non-significant variables you add into the model, the gap in
R-squared and Adjusted R-squared increases.
Forecast KPI’s are used to evaluate forecast accuracy. The several KPI’s for it is
as follows:
Bias: Bias represents the historical average error. Basically, will your forecasts
be on average too high (i.e. you overshot the demand) or too low (i.e.
you undershot the demand)? This will give you the overall direction of the error.
www.learnbay.co
Machine Learning Practice Q&A set
Mean Absolute Percentage Error MAPE is the sum of the individual absolute
errors divided by the demand (each period separately). Actually, it is the
average of the percentage errors.
Issue with MAPE: MAPE divides each error individually by the demand, so it is
skewed: high errors during low-demand periods will have a major impact on
MAPE. Due to this, optimizing MAPE will result in a strange forecast that will
most likely undershoot the demand. Just avoid it.
The Mean Absolute Error (MAE) is a very good KPI to measure forecast
accuracy. As the name implies, it is the mean of the absolute error.
Issue with MAE: One of the first issues of this KPI is that it is not scaled to the
average demand. If one tells you that MAE is 10 for a particular item, you
cannot know if this is good or bad. If your average demand is 1000, it is of
course astonishing, but if the average demand is 1, this is a very poor accuracy.
To solve this, it is common to divide MAE by the average demand to get a %
www.learnbay.co
Machine Learning Practice Q&A set
Issue with RMSE: Just as for MAE, RMSE is not scaled to the demand. We can
then define RMSE% as such,
Remove highly correlated predictors from the model. If you have two
or more factors with a high VIF, remove one from the model. Because
they supply redundant information, removing one of the correlated
factors usually doesn't drastically reduce the R-squared. Consider
using stepwise regression, best subsets regression, or specialized
knowledge of the data set to remove these variables. Select the model
that has the highest R-squared value.
www.learnbay.co
Machine Learning Practice Q&A set
Linear Regression Code Snippet.
…..
Logistics Regression
www.learnbay.co
Machine Learning Practice Q&A set
1 / (1 + e^-value)
The ‘e’ in the above equation represents the S-shaped curve that has values
between 0 and 1. We write the equation for logistic regression as follows:
In the above equation, b0 and b1 are the two coefficients of the input x. We
estimate these two coefficients using “maximum likelihood estimation”. Below
is the maximum log-likelihood cost function for Logistic regression.
To get a better grasp on this cost function, let's look at the cost that we
calculate for one single-sample instance:
www.learnbay.co
Machine Learning Practice Q&A set
Looking at the preceding equation, we can see that the first term becomes
zero if y = 0, and the second term becomes zero if y =1, respectively:
The main advantage of MLE is that it has asymptotic property. It means that
when the size of the data increases, the estimate converges faster towards the
population parameter. We use MLE for many techniques in statistics to
estimate parameters. I have explained the general steps we follow to find an
estimate for the parameter.
Step 2: Formulate the likelihood function for the data, using the data
generating function.
www.learnbay.co
Machine Learning Practice Q&A set
The likelihood function is nothing but the probability of observing this data
given the parameters (P(D|θ)). The parameters depend on our assumptions
and the data-generating function.
This is done by finding the estimate that maximizes the likelihood function.
This is the reason why we name the estimator calculated using MLE as M-
estimator.
Example: We have tossed a coin n times and observed k heads. Note that we
consider the read is a success the and tail is a failure.
𝑃̂ =𝑘/n
Example:
Suppose in a throw of a fair dice, A is the event that either 1 or 2 will surface.
Odds of A is 50%, in this case, which can be explained as the likelihood of the
event A is 50% of the likelihood of the complementary event of A.
www.learnbay.co
Machine Learning Practice Q&A set
No. The reasons why linear regressions cannot be used in case of binary
classification are as follows:
Distribution of error terms: The distribution of data in case of linear and
logistic regression is different. Linear regression assumes that error terms are
normally distributed. In case of binary classification, this assumption does not
hold true.
Model output: In linear regression, the output is continuous. In case of binary
classification, an output of a continuous value does not make sense. For binary
classification problems, linear regression may predict values that can go
beyond 0 and 1. If we want the output in the form of probabilities, which can
be mapped to two different classes, then its range should be restricted to 0
and 1. As the logistic regression model can output probabilities with
logistic/sigmoid function, it is preferred over linear regression.
Variance of Residual errors: Linear regression assumes that the variance of
random errors is constant. This assumption is also violated in case of logistic
regression.
2. The Logistic regression assumes that the independent variables are linearly
related to the log of odds.
4. The Logistic regression which has two classes assumes that the dependent
variable is binary and ordered logistic regression requires the dependent
variable to be ordered.
www.learnbay.co
Machine Learning Practice Q&A set
24. What is the decision boundary in case of Logistics
Regression?
For Logistic Regression, we only have Linear Decision Boundary. The Boundary
or line that separates the two classes are chosen by setting a threshold
probability.
Step3: Classification will happen by separating the points with probability less
than or greater than 0.5
www.learnbay.co
Machine Learning Practice Q&A set
The inputs to a logistic regression model need to be numeric. The algorithm
cannot handle categorical variables directly. So, they need to be converted into
a format that is suitable for the algorithm to process.
The most famous method of dealing with multiclass classification using logistic
regression is using the one-vs-all approach. Under this approach, a number of
models are trained, which is equal to the number of classes. The models work
in a specific way.
For example, the first model classifies the datapoint depending on whether it
belongs to class 1 or some other class; the second model classifies the
datapoint into class 2 or some other class. This way, each data point can be
checked over all the classes.
www.learnbay.co
Machine Learning Practice Q&A set
However, there are problems with accuracy. It assumes equal costs for both
kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or
terrible depending upon the problem.
Precision: To get the value of precision we divide the total number of correctly
classified positive examples by the total number of predicted positive
examples. High Precision indicates an example labelled as positive is indeed
positive (a small number of FP).
www.learnbay.co
Machine Learning Practice Q&A set
PR Curve: It is the curve between precision and recall for various threshold
values. In the figure below we have 6 predictors showing their respective
precision-recall curve for various threshold values. The top right part of the
graph is the ideal space where we get high precision and recall. Based on our
application we can choose the predictor and the threshold value. PR AUC is
just the area under the curve. The higher its numerical value the better.
www.learnbay.co
Machine Learning Practice Q&A set
ROC Curve: ROC stands for receiver operating characteristic and the graph is
plotted against TPR and FPR for various threshold values. As TPR increases FPR
also increases. As you can see in the first figure, we have four categories and
we want the threshold value that leads us closer to the top left corner.
Comparing different predictors (here 3) on a given dataset also becomes easy
as you can see in figure 2, one can choose the threshold according to the
application at hand. ROC AUC is just the area under the curve, the higher its
numerical value the better.
PR vs ROC Curve:
Both the metrics are widely used to judge a model’s performance.
www.learnbay.co
Machine Learning Practice Q&A set
Due to the absence of TN in the precision-recall equation, they are useful in
imbalanced classes. In the case of class imbalance when there is a majority of
the negative class. The metric doesn’t take much into consideration the high
number of TRUE NEGATIVES of the negative class, which is in majority, giving
better resistance to the imbalance. This is important when the detection of the
positive class is very important.
Like to detect cancer patients, which has a high-class imbalance because very
few have it out of all the diagnosed. We certainly don’t want to miss on a
person having cancer and going undetected (recall) and be sure the detected
one is having it (precision).
Due to the consideration of TN or the negative class in the ROC equation, it is
useful when both the classes are important to us. Like the detection of cats
and dog. The importance of true negatives makes sure that both the classes
are given importance, like the output of a CNN model in determining the image
is of a cat or a dog.
www.learnbay.co
Machine Learning Practice Q&A set
…..
30. What is Decision Tree Algorithm?
www.learnbay.co
Machine Learning Practice Q&A set
31. What are the different algorithms used in Decision
Tree? Discuss.
Pruning is a technique used in Decision Tree that reduces the size of decision
trees by removing sections of the tree that provide little power to classify
instances. Pruning reduces the complexity of the final classifier, and hence
improves predictive accuracy by the reduction of overfitting.
There are generally two methods for pruning trees: pre-pruning and post-
pruning.
Post-pruning will involve fully growing the tree in its entirety, and then
trimming the nodes of the tree in a bottom-up fashion.
www.learnbay.co
Machine Learning Practice Q&A set
Let’s assume we have a sample dataset of 1000 instances (x) and we are using
the CART algorithm. Bagging of the CART algorithm would work as follows.
Advantages
1. Easy to Understand: Decision tree output is very easy to understand
even for people from non-analytical background. It does not require any
statistical knowledge to read and interpret them. Its graphical
representation is very intuitive and users can easily relate their
hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to
identify most significant variables and relation between two or more
variables. With the help of decision trees, we can create new variables /
features that has better power to predict target variable. You can refer
article (Trick to enhance power of regression model) for one such trick.
It can also be used in data exploration stage. For example, we are
working on a problem where we have information available in hundreds
of variables, there decision tree will help to identify most significant
variable.
3. Less data cleaning required: It requires less data cleaning compared to
some other modelling techniques. It is not influenced by outliers and
missing values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and
categorical variables.
5. Non-Parametric Method: Decision tree is considered to be a non-
parametric method. This means that decision trees have no assumptions
about the space distribution and the classifier structure.
Disadvantages
www.learnbay.co
Machine Learning Practice Q&A set
1. Over fitting: Over fitting is one of the most practical difficulty for
decision tree models. This problem gets solved by setting constraints on
model parameters and pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous
numerical variables, decision tree loses information when it categorizes
variables in different categories.
3. Decision Trees do not work well if you have smooth boundaries. i.e they
work best when you have discontinuous piece wise constant model. If
you truly have a linear target function decision trees are not the best.
www.learnbay.co
Machine Learning Practice Q&A set
…..
www.learnbay.co
Machine Learning Practice Q&A set
38. What is Random Forest? Why is it preferred over
Decision Trees?
Advantage
Disadvantage
www.learnbay.co
Machine Learning Practice Q&A set
Construction of Random forests are much harder and time-consuming
than decision trees.
More computational resources are required to implement Random
Forest algorithm.
It is less intuitive in case when we have a large collection of decision trees.
The prediction process using random forests is very time-consuming in
comparison with other algorithms.
www.learnbay.co
Machine Learning Practice Q&A set
The technique of cross validation (CV) is best explained by example using the
most common method, K-Fold CV When we approach a machine learning
problem, we make sure to split our data into a training and a testing set.
In K-Fold CV, we further split our training set into K number of subsets, called
folds. We then iteratively fit the model K times, each time training the data on
K-1 of the folds and evaluating on the Kth fold (called the validation data).
In Layman Term: When you train each tree in random forest, you will not use
all the samples. So for each bag, those unused samples can be used to find the
prediction error for that particular bag. The OOB error rate can then be
obtained by averaging the prediction error from all the bags.
www.learnbay.co
Machine Learning Practice Q&A set
43. What are different Hyperparameter tuning
involved in Random Forest?
www.learnbay.co
Machine Learning Practice Q&A set
44. Random Forest Code Snippet:
…..
Of the k closest data points (or ‘neighbors’), the category with the highest
number of k-close neighbors is the category assigned to the new data point.
www.learnbay.co
Machine Learning Practice Q&A set
Intuitively, this makes sense - a data point belongs in the category it’s most
similar to with respect to its features/properties. The most similar data points
are the ones that are nearest to that data point, if you visualize it on a graph.
1. On a new data point, a kNN will calculate it’s distance from every
single data point in our dataset. The most popular distance metric
used is the Euclidean Distance.
2. Once every single distance is calculated, the algorithm will pick
the K nearest data points.
3. For classification problems, it’ll make a prediction based on the class
of those k-nearest datapoints. In this context, it’s a good idea to
choose K as an odd number to avoid dies. The predicted class will be
the most frequent occuring class within K data points. For regression,
it can predict based on the mean or median of those data points.
Short Answer:
If K is small, then results might not be reliable because noise will have a
higher influence on the result. If K is large, then there will be a lot of
processing which may adversely impact the performance of the algorithm.
So, following is must be considered while choosing the value of K:
K should be odd so that there are no ties in the voting. If square root of
number of data points is even, then add or subtract 1 to it to make it odd.
www.learnbay.co
Machine Learning Practice Q&A set
Long Answer:
The value of optimum K totally depends on the dataset that you are using.
The best value of K for KNN is highly data-dependent. In different
scenarios, the optimum K may vary. It is more or less hit and trail method.
A small value of K means that noise will have a higher influence on the
result.
Larger the value of K, higher is the accuracy. If K is too large, you are
under-fitting your model. In this case, the error will go up again. So, at the
same time you also need to prevent your model from under-fitting. Your
model should retain generalization capabilities otherwise there are fair
chances that your model may perform well in the training data but
drastically fail in the real data. Larger K will also increase the
computational expense of the algorithm.
1. Square Root Method: Take square root of the number of samples in the
training dataset.
2. Cross Validation Method: We should also use cross validation to find out
the optimal value of K in KNN. Start with K=1, run cross validation (5 to 10
fold), measure the accuracy and keep repeating till the results become
consistent.
K=1, 2, 3... As K increases, the error usually goes down, then stabilizes, and
then raises again. Pick the optimum K at the beginning of the stable zone.
This is also called Elbow Method.
www.learnbay.co
Machine Learning Practice Q&A set
3. Domain Knowledge also plays a vital role while choosing the optimum
value of K.
I would suggest to try a mix of all the above points to reach any
conclusion.
When it gets the training data, it does not learn and make a model, it just
stores the data. It does not derive any discriminative function from the
training data. It uses the training data when it actually needs to do some
prediction. So, KNN does not immediately learn a model, but delays the
learning, that is why it is called lazy learner.
KNN works well with smaller dataset because it is a lazy learner. It needs
to store all the data and then makes decision only at run time. It needs to
calculate the distance of a given point with all other points. So if dataset is
large, there will be a lot of processing which may adversely impact the
performance of the algorithm.
KNN is also very sensitive to noise in the dataset. If the dataset is large,
there are chances of noise in the dataset which adversely affect the
performance of KNN algorithm. For each new data point, the kNN classifier
must:
1. Calculate the distances to all points in the training set and store
them
2. Sort the calculated distances
3. Store the K nearest points
4. Calculate the proportions of each class
5. Assign the class with the highest proportion
Obviously, this is a very taxing process, both in terms of time and space
complexity. The first operation is a quadratic time process, and the sorting
www.learnbay.co
Machine Learning Practice Q&A set
a O(n log n) process. Together, one could say that the process is a O(n³ log
n) process; a monstrously long process indeed.
Another problem is memory, since all pairwise distances must be stored
and sorted in memory on a machine. With very large datasets, local
machines will usually crash.
Advantages of KNN
3. KNN is very easy to implement. There are only two parameters required
to implement KNN i.e. the value of K and the distance function (e.g.
Euclidean or Manhattan etc.)
Disadvantages of KNN
1. Does not work well with large dataset: In large datasets, the cost of
calculating the distance between the new point and each existing points is
huge which degrades the performance of the algorithm.
2. Does not work well with high dimensions: The KNN algorithm doesn't
work well with high dimensional data because with large number of
dimensions, it becomes difficult for the algorithm to calculate the distance
in each dimension.
www.learnbay.co
Machine Learning Practice Q&A set
3. Need feature scaling: We need to do feature scaling (standardization
and normalization) before applying KNN algorithm to any dataset. If we
don't do so, KNN may generate wrong predictions.
The Chebyshev distance between two vectors or points p and q, with standard
coordinates and respectively, is:
www.learnbay.co
Machine Learning Practice Q&A set
chessboard to another equals the Chebyshev distance between the centres of
the squares
Create dummy variables out of a categorical variable and include them instead
of original categorical variable. Unlike regression, create k dummies instead of
(k-1).
www.learnbay.co
Machine Learning Practice Q&A set
For example, a categorical variable named “Department” has 5 unique levels /
categories. So we will create 5 dummy variables. Each dummy variable has 1
against its department and else 0.
Yes, K-nearest neighbour can be used for regression. In other words, K-nearest
neighbour algorithm can be applied when dependent variable is continuous. In
this case, the predicted value is the average of the values of its k nearest
neighbours
KNN and k-means clustering both are very different algorithms that solve
different problems and have their own meanings of what the variable ‘k’
is. KNN is a supervised classification algorithm that will label new data points
based on the ‘k’ number of nearest data points and k-means clustering is an
unsupervised clustering algorithm that groups the data into ‘k’ number of
clusters.
Sampling does several things in the perspective of a single data point, since
kNN works on a point-by-point basis.
1. The average distance to the k nearest neighbours increases due to
increased sparsity in the dataset.
www.learnbay.co
Machine Learning Practice Q&A set
2. Consequently, the area covered by k-nearest neighbours increases in
size and covers a larger area of the feature space.
3. The sample variance increases.
A consequence to this change in input is an increase in variance. When we talk
of variance, we refer to the variability in the predictions given different
samples from the population. Why would the immediate effects of sampling
lead to increased variance of the model?
Notice that now a larger area of the feature space is represented by the same k
data points. While our sample size has not grown, the population space that it
represents has increased in size. This will result in higher variance in the
proportion of classes in the k nearest data points, and consequently a higher
variance in the classification of each data point.
Long Answer: What really is the significance of these effects? First, it gives
hints that a lower k value makes the kNN model more “sensitive.” That is, it is
more sensitive to the local changes in the dataset. The “sensitivity” of the
model directly translates to its variance.
Final Verdict: In order to offset the increased variance due to sampling, k can
be increased to decrease model variance.
www.learnbay.co
Machine Learning Practice Q&A set
3. For getting the predicted class, iterate from 1 to total number of training
data points
1. Calculate the distance between test data and each row of training
data. Here we will use Euclidean distance as our distance metric since
it’s the most popular method. The other metrics that can be used are
Chebyshev, cosine, etc.
2. Sort the calculated distances in ascending order based on distance
values
3. Get top k rows from the sorted array
4. Get the most frequent class of these rows
5. Return the predicted class
www.learnbay.co
Machine Learning Practice Q&A set
59. What is SVM Algorithm?
SVM stands for support vector machine, it is a supervised machine learning
algorithm which can be used for both Regression and Classification.
In this algorithm, we plot each data item as a point in n-dimensional space
(where n is number of features you have) with the value of each feature being
the value of a particular coordinate.
For example, if we only had two features like Height and Hair length of an
individual, we’d first plot these two variables in two-dimensional space where
each point has two co-ordinates (these co-ordinates are known as Support
Vectors)
Now, we will find some line that splits the data between the two differently
classified groups of data. This will be the line such that the distances from the
closest point in each of the two groups will be farthest away.
In the example shown above, the line which splits the data into two differently
classified groups is the black line, since the two closest points are the farthest
apart from the line. This line is our classifier. Then, depending on where the
www.learnbay.co
Machine Learning Practice Q&A set
testing data lands on either side of the line, that’s what class we can classify
the new data as.
A support vector machine attempts to find the line that "best" separates two
classes of points. By "best", we mean the line that results in the largest
margin between the two classes. The points that lie on this margin are
the support vectors.
The vectors that define the hyperplane are the support vectors.
SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the
required form. Different SVM algorithms use different types of kernel
functions. These functions can be different types.
1. Linear Kernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel
www.learnbay.co
Machine Learning Practice Q&A set
Long Answer:
Short Answer: Because it places the decision boundary such that it maximizes
the distance between two clusters.
Long Answer: choosing the best hyperplane is to choose one in which the
distance from the training points is the maximum. This is formalized by
the geometric margin. Without getting into the details of the derivation, the
geometric margin is given by:
www.learnbay.co
Machine Learning Practice Q&A set
Which is simply the functional margin normalized. So, these intuitions lead to
the maximum margin classifier which is a precursor to the SVM.
1. SVM tries to find the “best” margin (distance between the line and the
support vectors) that separates the classes and this reduces the risk of
error on the data, while logistic regression does not, instead it can have
different decision boundaries with different weights that are near the
optimal point.
2. SVM works well with unstructured and semi-structured data like text
and images while logistic regression works with already identified
independent variables.
Depending on the number of training sets (data)/features that you have, you
can choose to use either logistic regression or support vector machine.
www.learnbay.co
Machine Learning Practice Q&A set
n = number of features,
m = number of training examples
Short Answer:
Cost and Gamma are the hyper-parameters that decide the performance of an
SVM model. There should be a fine balance between Variance and Bias for any
ML model. (this is a science and an art - as we call it in empirical studies)
For SVM, a High value of Gamma leads to more accuracy but biased results and
vice-versa. Similarly, a large value of Cost parameter (C) indicates poor
accuracy but low bias and vice-versa.
Following table summarizes the above explanation -
The art is to choose a model with optimum variance and bias. Therefore, you
need to choose the values of C and Gamma accordingly.
Optimum values of C and Gamma can be found by using methods like
Gridsearch.
Long Answer:
The C parameter tells the SVM optimization how much you want to avoid
misclassifying each training example. For large values of C, the optimization
will choose a smaller-margin hyperplane if that hyperplane does a better job of
getting all the training points classified correctly. Conversely, a very small value
of C will cause the optimizer to look for a larger margin separating hyperplane,
even if that hyperplane misclassifies more points. For very tiny values of C, you
www.learnbay.co
Machine Learning Practice Q&A set
should get misclassified examples, often even if your training data is linearly
separable.
The gamma parameter defines how far the influence of a single training
example reaches, with low values meaning ‘far’ and high values meaning
‘close’.
The gamma parameters can be seen as the inverse of the radius of influence of
samples selected by the model as support vectors. If gamma is too large, the
radius of the area of influence of the support vectors only includes the support
vector itself and no amount of regularization with C will be able to prevent
overfitting.
When gamma is very small, the model is too constrained and cannot capture
the complexity or “shape” of the data. The region of influence of any selected
support vector would include the whole training set. The resulting model will
behave similarly to a linear model with a set of hyperplanes that separate the
centres of high density of any pair of two classes.
SVM Advantages
SVM’s are very good when we have no idea on the data.
Works well with even unstructured and semi structured data like text,
Images and trees.
The kernel trick is real strength of SVM. With an appropriate kernel
function, we can solve any complex problem.
Unlike in neural networks, SVM is not solved for local optima.
It scales relatively well to high dimensional data.
SVM models have generalization in practice, the risk of over-fitting is less
in SVM.
SVM is always compared with ANN. When compared to ANN models,
SVMs give better results.
SVM Disadvantages
Choosing a “good” kernel function is not easy.
Long training time for large datasets.
www.learnbay.co
Machine Learning Practice Q&A set
Difficult to understand and interpret the final model, variable weights
and individual impact.
Since the final model is not so easy to see, we can not do small
calibrations to the model hence its tough to incorporate our business
logic.
The SVM hyper parameters are Cost -C and gamma. It is not that easy to
fine-tune these hyper-parameters. It is hard to visualize their impact.
www.learnbay.co
Machine Learning Practice Q&A set
…..
Naïve Bayes
www.learnbay.co
Machine Learning Practice Q&A set
the probability of X given a test result or P(X|Test). For example, what is the
probability an e-mail is spam given that my spam filter classified it as spam.
In Layman’s Term: The simple meaning of Naive is willing to believe that that
life is simple and fair, which is not true. Naive Bayes is naive because it
assumes that the features that are going into the model are not related to
each other anyhow Change in one variable will not affect the other variable
directly.
Long Answer: Naive Bayes (NB) is ‘naive’ because it makes the assumption that
features of a measurement are independent of each other. This is naive
because it is (almost) never true. Here is how it work even then - NB is a very
intuitive classification algorithm. It asks the question, “Given these features,
does this measurement belong to class A or B?”, and answers it by taking the
proportion of all previous measurements with the same features belonging to
class A multiplied by the proportion of all measurements in class A. If this
number is bigger than the corresponding calculation for class B then we say
the measurement belongs in class A.
www.learnbay.co
Machine Learning Practice Q&A set
Feature matrix contains all the vectors(rows) of dataset in which each
vector consists of the value of dependent features.
Response vector contains the value of class variable (prediction or
output) for each row of feature matrix.
ADVANTAGES
1. Fast
2. Highly scalable.
3. Used for binary and Multi class Classification.
4. Great Choice for text classification.
5. Can easily train smaller data sets.
DISADVANTAGES
Naive Bayes considers that the features are independent of each other.
However, in real world, features depend on each other.
www.learnbay.co
Machine Learning Practice Q&A set
www.learnbay.co
Machine Learning Practice Q&A set
The ‘means’ in the K-means refers to averaging of the data; that is, finding the
centroid.
The Elbow method looks at the total WSS as a function of the number of
clusters: One should choose a number of clusters so that adding another
cluster doesn’t improve much better the total WSS.
www.learnbay.co
Machine Learning Practice Q&A set
www.learnbay.co
Machine Learning Practice Q&A set
clustering to determine the number and then use k-means clustering to make
more stable clusters as hierarchical clustering is a single-pass exercise whereas
k-means is an iterative process.
K-Means Advantages :
K-Means Disadvantages :
www.learnbay.co
Machine Learning Practice Q&A set
www.learnbay.co
Machine Learning Practice Q&A set
merge or agglomerate (bottom-up approach) the pairs of clusters. The
hierarchy of the clusters is represented as a dendrogram or tree structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical
algorithms, all the data points are treated as one big cluster and the process of
clustering involves dividing (Top-down approach) the one big cluster into
various small clusters.
Parts of Dendogram:
www.learnbay.co
Machine Learning Practice Q&A set
…..
www.learnbay.co
Machine Learning Practice Q&A set
Gradient Boosting Algorithms
a. Adaboost
b. GBM
c. XGBoost
How it evolved: The first Boosting Algorithm gained popularity was AdaBoost
or Adaptive Boosting. Further it evolved and generalised as Gradient Boosting.
Adaboost combines multiple weak learners into a single strong learner. The
weak learners in AdaBoost are decision trees with a single split, called decision
stumps. When AdaBoost creates its first decision stump, all observations are
weighted equally. To correct the previous error, the observations that were
incorrectly classified now carry more weight than the observations that were
correctly classified. AdaBoost algorithms can be used for both classification
and regression problem.
www.learnbay.co
Machine Learning Practice Q&A set
GBM uses Gradient Descent to find the shortcomings in the previous learner’s
predictions. GBM algorithm can be given by following steps.
Fit a model to the data, F1(x) = y
Create a new model, F2(x) = F1(x) + h1(x)
By combining weak learner after weak learner, our final model is able to
account for a lot of the error from the original model and reduces this error
over time.
www.learnbay.co
Machine Learning Practice Q&A set
www.learnbay.co
Machine Learning Practice Q&A set
www.learnbay.co
Machine Learning Practice Q&A set
Below are some constraints that can be imposed on the construction of
decision trees:
Number of trees, generally adding more trees to the model can be very slow
to overfit. The advice is to keep adding trees until no further improvement is
observed.
Tree depth, deeper trees are more complex trees and shorter trees are
preferred. Generally, better results are seen with 4-8 levels.
Number of nodes or number of leaves, like depth, this can constrain the size of
the tree, but is not constrained to a symmetrical structure if other constraints
are used.
Number of observations per split imposes a minimum constraint on the
amount of training data at a training node before a split can be considered
Minimum improvement to loss is a constraint on the improvement of any split
added to a tree.
A big insight into bagging ensembles and random forest was allowing trees to
be greedily created from subsamples of the training dataset.This same benefit
can be used to reduce the correlation between the trees in the sequence in
gradient boosting models.This variation of boosting is called stochastic
gradient boosting.
Classical decision trees like CART are not used as weak learners, instead a
modified form called a regression tree is used that has numeric values in the
leaf nodes (also called terminal nodes). The values in the leaves of the trees
can be called weights in some literature.As such, the leaf weight values of the
trees can be regularized using popular regularization functions, such as:
www.learnbay.co
Machine Learning Practice Q&A set
L1 regularization of weights.
L2 regularization of weights.
The additional regularization term helps to smooth the final learnt weights to
avoid over-fitting. Intuitively, the regularized objective will tend to select a
model employing simple and predictive functions.
…..
Dimensionality Reduction Algorithms/Techniques
High dimensionality means that the dataset has a large number of features. The
primary problem associated with high dimensionality in the machine learning
www.learnbay.co
Machine Learning Practice Q&A set
field is model overfitting, which reduces the ability to generalize beyond the
examples in the training set.
For the sake of intuition, let us consider variance as the spread of data -
distance between the two farthest points.
Assumption:
Typically, it is believed, that if the variance of data is large, it offers more
information, than data which has small variance. (This may or may not be
true). This is the assumption which PCA intends to exploit.
That was the intuitive explanation of what PCA does. Just for further
clarification
Eigenvalues = variance of the data along a particular axis in the new coordinate
system. In above example, Eigenvalue(X(new)) = 4.24.
www.learnbay.co
Machine Learning Practice Q&A set
Eigenvectors = the vectors which represent the new coordinate system. In
above example, vector [1,1], would be an eigenvector for X(new), and [1,-1]
eigenvector for Y(new). Since they are just directions - solvers typically give us
unit vectors.
Diagnolization: This is the part where you equate covariance to lambda*I. This
is basically trying to find an eigenvector, such that all points would lie on the
same line, and thus it will have only elements of variance, and covariance
terms would be zero.
Steps of PCA:
www.learnbay.co
Machine Learning Practice Q&A set
www.learnbay.co
Machine Learning Practice Q&A set
Basically, LDA helps you find the 'boundaries' around clusters of classes. It
projects your data points on a line so that your clusters 'are as separated as
possible', with each cluster having a relative (close) distance to a centroid.
What was that stuff about dimensionality?
Let's say you have a group of data points in 2 dimensions, and you want to
group them into 2 groups. LDA reduces the dimensionality of your set like so:
K(Groups) = 2. 2-1 = 1.
www.learnbay.co
Machine Learning Practice Q&A set
This can be summarized by the matrix multiplication.
Y = X x W (where X is a n * d dimension matrix representing the n samples
and you are transformed n * k dimensional samples in the new subspace.
www.learnbay.co
Machine Learning Practice Q&A set
Disadvantages:
…..
www.learnbay.co