KNN Interview Question Rev 2.0
KNN Interview Question Rev 2.0
A) TRUE
B) FALSE
Solution: A
The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples
nearest to that query point – hence higher computation.
2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-Nearest Neighbor.
A) 3
B) 10
C) 20
D 50
Solution: B
Validation error is the least when the value of k is 10. So it is best to use this value of k
6) Which of the following machine learning algorithm can be used for imputing missing values of both categorical and
continuous variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
Solution: A
k-NN algorithm can be used for imputing missing value of both categorical and continuous variables.
8) Which of the following distance measure do we use in case of categorical variables in k-NN?
1. Hamming Distance
2. Euclidean Distance
3. Manhattan Distance
A) 1
B) 2
C) 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
Solution: A
Both Euclidean and Manhattan distances are used in case of continuous variables, whereas hamming distance is used in case
of categorical variable.
9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1
10) Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1
Context: 11-12
Suppose, you have given the following data where x and y are the 2 input variables and Class is the dependent variable.
A) + Class
B) – Class
C) Can’t say
D) None of these
Solution: A
All three nearest point are of +class so this point will be classified as +class.
12) In the previous question, you are now want use 7-NN instead of 3-KNN which of the following x=1 and y=1 will belong to?
A) + Class
B) – Class
C) Can’t say
Solution: B
Now this point will be classified as – class because there are 4 – class and 3 +class point are in nearest circle.
Context 13-14:
Suppose you have given the following 2-class data where “+” represent a postive class and “” is represent negative class.
13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy?
A) 3
B) 5
C) Both have same
D) None of these
Solution: B
5-NN will have least leave one out cross validation error.
14) Which of the following would be the leave on out cross validation accuracy for k=5?
A) 2/14
B) 4/14
C) 6/14
D) 8/14
E) None of the above
Solution: E
In 5-NN we will have 10/14 leave one out cross validation accuracy.
15) Which of the following will be true about k in k-NN in terms of Bias?
A) When you increase the k the bias will be increases
B) When you decrease the k the bias will be increases
C) Can’t say
D) None of these
Solution: A
large K means simple model, simple model always condider as high bias
16) Which of the following will be true about k in k-NN in terms of variance?
A) When you increase the k the variance will increases
B) When you decrease the k the variance will increases
C) Can’t say
D) None of these
Solution: B
Simple model will be consider as less variance model
17) The following two distances(Eucludean Distance and Manhattan Distance) have given to you which generally we used in K-
NN algorithm. These distance are between two points A(x1,y1) and B(x2,Y2).
Your task is to tag the both distance by seeing the following two graphs. Which of the following option is true about below graph
?
18) When you find noise in data which of the following option would you consider in k-NN?
A) I will increase the value of k
B) I will decrease the value of k
C) Noise can not be dependent on value of k
D) None of these
Solution: A
To be more sure of which classifications you make, you can try increasing the value of k.
19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would you consider to
handle such problem?
1. Dimensionality Reduction
2. Feature selection
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
In such case you can use either dimensionality reduction algorithm or the feature selection algorithm
20) Below are two statements given. Which of the following will be true both statements?
1. k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data.
2. The computational complexity for classifying new samples grows linearly with the number of samples in the training
dataset in the worst-case scenario.
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both are true and self explanatory
21) Suppose you have given the following images(1 left, 2 middle and 3 right), Now your task is to find out the value of k in k-
NN in each image where k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.
A) k1 > k2> k3
B) k1<k2
C) k1 = k2 = k3
D) None of these
Solution: D
Value of k is highest in k3, whereas in k1 it is lowest
22) Which of the following value of k in the following graph would you give least leave one out cross validation accuracy?
A) 1
B) 2
C) 3
D) 5
Solution: B
If you keep the value of k as 2, it gives the lowest cross validation accuracy. You can try this out yourself.
23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed this model on client
side it has been found that the model is not at all accurate. Which of the following thing might gone wrong?
Note: Model has successfully deployed and no technical issues are found at client side except the model performance
24) You have given the following 2 statements, find which of these option is/are true in case of k-NN?
1. In case of very large value of k, we may include points from other classes into the neighborhood.
2. In case of too small value of k the algorithm is very sensitive to noise
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both the options are true and are self explanatory.
26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier?
A) TRUE
B) FALSE
Solution: A
You can implement a 2-NN classifier by ensembling 1-NN classifiers
27) In k-NN what will happen when you increase/decrease the value of k?
A) The boundary becomes smoother with increasing value of K
B) The boundary becomes smoother with decreasing value of K
C) Smoothness of boundary doesn’t dependent on value of K
D) None of these
Solution: A
The decision boundary would become smoother by increasing the value of K
28) Following are the two statements given for k-NN algorthm, which of the statement(s)
is/are true?
1. We can choose optimal value of k with the help of cross validation
2. Euclidean distance treats each feature as equally important
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both the statements are true
Context 29-30:
Suppose, you have trained a k-NN model and now you want to get the prediction on test data. Before getting the prediction
suppose you want to calculate the time taken by k-NN for predicting the class for test data.
Note: Calculating the distance between 2 observation will take D time.
29) What would be the time taken by 1-NN if there are N(Very large) observations in test data?
A) N*D
B) N*D*2
C) (N*D)/2
D) None of these
Solution: A
The value of N is very large, so option A is correct
30) What would be the relation between the time taken by 1-NN,2-NN,3-NN.
A) 1-NN >2-NN >3-NN
B) 1-NN < 2-NN < 3-NN
C) 1-NN ~ 2-NN ~ 3-NN
D) None of these
Solution: C
The training time for any value of k in kNN algorithm is the same.
Overall Distribution
Below is the distribution of the scores of the participants:
You can access the scores here. More than 250 people participated in the skill test and the highest score obtained was 24.
K = Number of nearest neighbors you want to select to predict the class of a given item
If K is small, then results might not be reliable because noise will have a higher influence on the result. If K is large, then there
will be a lot of processing which may adversely impact the performance of the algorithm. So, following is must be considered
while choosing the value of K:
More details...
K should be odd so that there are no ties in the voting. If square root of number of data points is even, then add or subtract 1 to
it to make it odd.
34. What is the difference between Euclidean Distance and Manhattan distance? What is the formula of Euclidean distance and
Manhattan distance?
Both are used to find out the distance between two points.
Euclidean Distance and Manhattan Distance Formula
It should also be noted that all three distance measures are only valid for continuous variables. In the instance of categorical
variables the Hamming distance must be used. It also brings up the issue of standardization of the numerical variables between
0 and 1 when there is a mixture of numerical and categorical variables in the dataset.
When it gets the training data, it does not learn and make a model, it just stores the data. It does not derive any discriminative
function from the training data. It uses the training data when it actually needs to do some prediction. So, KNN does not
immediately learn a model, but delays the learning, that is why it is called lazy learner.
36. Why should we not use KNN algorithm for large datasets?
KNN works well with smaller dataset because it is a lazy learner. It needs to store all the data and then makes decision only at
run time. It needs to calculate the distance of a given point with all other points. So if dataset is large, there will be a lot of
processing which may adversely impact the performance of the algorithm.
KNN is also very sensitive to noise in the dataset. If the dataset is large, there are chances of noise in the dataset which
adversely affect the performance of KNN algorithm.
Answer
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is
again Default=Y.
Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good
sign of robustness.
Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case where variables have different
measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on
annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance
calculated. One solution is to standardize the training set as shown below.
39. Is there a need to standardize the data before applying KNN Algorithm?
Yes. As the complete algorithm is based upon the concept of calculating distances the features on a different scale can impact
the outcomes so its always a mandate to standardize the data before applying algorithm.
40. Suppose we have a business requirement that we cannot convert or transform a categorical variable. Can we use KNN in
that case?
Yes. We could use KNN in such a case. We would need to change the default distance calculation method to hammings
distance.
Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make
predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are
retained as part of the model.
It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order
to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete
to “win” or be most similar to a given unseen data instance and contribute to a prediction.
Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy
because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a
localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger
training datasets.
Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure can be calculated
consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional
form.
Handle Data: Open the dataset from CSV and split into test/train datasets.
Similarity: Calculate the distance between two data instances.
Neighbors: Locate k most similar data instances.
Response: Generate a response from a set of data instances.
Accuracy: Summarize the accuracy of predictions.
Main: Tie it all together.
Link :- https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-
python-from-scratch/
43. How do we select the optimal value of K?
The second step is to select the k value. This determines the number of neighbors we look at when we assign a value to any
new observation.
In our example, for a value k = 3, the closest points are ID1, ID5 and ID6.
ID11 = 69.66 kg
For the value of k=5, the closest point will be ID1, ID4, ID5, ID6, ID10.
ID 11 = 65.2 kg
We notice that based on the k value, the final result tends to change. Then how can we figure out the optimum value of k? Let
us decide it based on the error calculation for our train and validation set (after all, minimizing the error is our final goal!).
Have a look at the below graphs for training error and validation error for different values of k.
For a very low value of k (suppose k=1), the model overfits on the training data, which leads to a high error rate on the
validation set. On the other hand, for a high value of k, the model performs poorly on both train and validation set. If you
observe closely, the validation error curve reaches a minima at a value of k = 9. This value of k is the optimum value of the
model (it will vary for different datasets). This curve is known as an ‘elbow curve‘ (because it has a shape like an elbow) and is
usually used to determine the k value.
44. What are potential problems with implementing kNN on a very large dataset?
One must understand what operations happen during each iteration of the algorithm. For each new data point, the kNN
classifier must:
Calculate the distances to all points in the training set and store them
Sort the calculated distances
Store the K nearest points
Calculate the proportions of each class
Assign the class with the highest proportion
Obviously this is a very taxing process, both in terms of time and space complexity. The first operation is a quadratic time
process, and the sorting a O(nlogn) process. Together, one could say that the process is a O(n³logn) process; a monstrously
long process indeed.
Another problem is memory, since all pairwise distances must be stored and sorted in memory on a machine. With very large
datasets, local machines will usually crash.
45. What are some ways of getting around the kNN-specific problems?
Source:https://www.researchgate.net/profile/Zhaolei_Zhang/publication/45855946/figure/fig7/
AS:307384896507918@1450297678757/Figure-7-Two-dimesional-embedding-of-10000-MNIST-test-data-using-the-Deep-
Neural.png
Solution #3: Random sampling to reduce training set size.
If you use a good random number generator and a decently large sample size, the sample should be a fairly good
representation of the original.
46. If using random sampling only once and supposing we know a good k value to use for the original data, how
should k be adjusted in accordance to a change in the input size?
Answer:
Two important points must be clarified to tackle this problem:
What effect does sampling have on the kNN model?
What effect does changing k have on the kNN model?
Point #1: Effects of sampling:
Source: http://cs231n.github.io/assets/knn.jpeg
Notice from the comparison that:
The number of distinct regions (in terms of color) goes down when the k parameter increases.
The class boundaries of the predictions become more smooth as k increases.
What really is the significance of these effects? First, it gives hints that a lower k value makes the kNN model more “sensitive.”
That is, it is more sensitive to the local changes in the dataset. The “sensitivity” of the model directly translates to its variance.
All of these examples point to an inverse relationship between variance and k. Additionally, consider how kNN operates
when k reaches its maximum value, k=n, where n is the number of points in the training set) In this case, the majority class in
the training set will always dominate the predictions. It will simply pick the most abundant class in the data, and never deviate,
effectively resulting in zero variance.Therefore it seems to reduce variance, k must be increased.
Final Verdict: In order to offset the increased variance due to sampling, k can be increased to decrease model variance.
47. If not restricted to a single sample, what could be a fairly simple method to reduce increased variance of the model
other than changing k?
Answer:
If not restricted in the number of times, one can draw samples from the original dataset, a simple variance reduction method
would be to sample,many times, and then simply take a majority vote of the kNN models fit to each of these samples to classify
each test data point. This variance reduction method is called bagging. You might have heard of bagging, since it is the core
concept in randomforest, a very popular tree ensemble method. We will explore this technique in greater detail in future posts.
Source:https://camo.githubusercontent.com/
715bf357ca41278b432ca0282908c2a4e0ee8c22/687474703a2f2f696d616765732e7363686f6c617270656469612e6f72672f77
2f696d616765732f382f38322f436f6d62696e696e675f636c617373696669657273322e6a7067
Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have
fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you
can imagine each training case as a "parameter" in the model.
1. K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a supervised learning
algorithm (dependent variable exists)
2. K-mean is a clustering technique which tries to split data points into K-clusters such that the points in each cluster
tend to be near each other whereas K-nearest neighbor tries to determine the classification of a point, combines the
classification of the K nearest points
Yes, K-nearest neighbor can be used for regression. In other words, K-nearest neighbor algorithm can be applied when
dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbors.
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems
Cons
Create dummy variables out of a categorical variable and include them instead of original categorical variable. Unlike
regression, create k dummies instead of (k-1). For example, a categorical variable named "Department" has 5 unique levels /
categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.
Using cross-validation, our mean score is about 71.36%. This is a more accurate representation of how our model will perform
on unseen data than our earlier testing using the holdout method.
After training, we can check which of our values for ‘n_neighbors’ that we tested performed the best. To do this, we will call
‘best_params_’ on our model.
#check top performing n_neighbors value
knn_gscv.best_params_
56. What are the problems with training and testing on the same data?
Goal is to estimate likely performance of a model on out-of-sample data
But, maximizing training accuracy rewards overly complex models that won't necessarily generalize
Unnecessarily complex models overfit the training data
Image Credit: Overfitting by Chabacano. Licensed under GFDL via Wikimedia Commons.
Green line (decision boundary): overfit
Your accuracy would be high but may not generalize well for future observations
Your accuracy is high because it is perfect in classifying your training data but not out-of-sample data
Black line (decision boundary): just right
Good for generalizing for future observations
57. Give an example of how do you split the dataset provided in to train test? What do you accomplish by using it?
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
test_size=0.4
40% of observations to test set
60% of observations to training set
data is randomly assigned unless you use random_state hyperparameter
If you use random_state=4
o Your data will be split exactly the same way
58. Give some pointers on training and testing accuracy based on model complexity? What is the indicator of
complexity in KNN Algorithm?
Training accuracy rises as model complexity increases
Testing accuracy penalizes models that are too complex or not complex enough
For KNN models, complexity is determined by the value of K (lower value = more complex)