0% found this document useful (0 votes)
227 views17 pages

KNN Interview Question Rev 2.0

Uploaded by

Akhil Shrivastav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
227 views17 pages

KNN Interview Question Rev 2.0

Uploaded by

Akhil Shrivastav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

1) [True or False] k-NN algorithm does more computation on test time rather than train time.

A) TRUE
B) FALSE
Solution: A
The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples
nearest to that query point – hence higher computation.

2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-Nearest Neighbor.

A) 3
B) 10
C) 20
D 50

Solution: B
Validation error is the least when the value of k is 10. So it is best to use this value of k

3) Which of the following distance metric can not be used in k-NN?


A) Manhattan
B) Minkowski
C) Tanimoto
D) Jaccard
E) Mahalanobis
F) All can be used
Solution: F
All of these distance metric can be used as a distance metric for k-NN.

4) Which of the following option is true about k-NN algorithm?


A) It can be used for classification
B) It can be used for regression
C) It can be used in both classification and regression
Solution: C
We can also use k-NN for regression problems. In this case the prediction can be based on the mean or the median of the k-
most similar instances.

5) Which of the following statement is true about k-NN algorithm?


1. k-NN performs much better if all of the data have the same scale
2. k-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large
3. k-NN makes no assumptions about the functional form of the problem being solved
A) 1 and 2
B) 1 and 3
C) Only 1
D) All of the above
Solution: D
The above mentioned statements are assumptions of kNN algorithm

6) Which of the following machine learning algorithm can be used for imputing missing values of both categorical and
continuous variables?
A) K-NN
B) Linear Regression
C) Logistic Regression
Solution: A
k-NN algorithm can be used for imputing missing value of both categorical and continuous variables.

7) Which of the following is true about Manhattan distance?


A) It can be used for continuous variables
B) It can be used for categorical variables
C) It can be used for categorical as well as continuous
D) None of these
Solution: A
Manhattan Distance is designed for calculating the distance between real valued features.

8) Which of the following distance measure do we use in case of categorical variables in k-NN?
1. Hamming Distance
2. Euclidean Distance
3. Manhattan Distance
A) 1
B) 2
C) 3
D) 1 and 2
E) 2 and 3
F) 1,2 and 3
Solution: A
Both Euclidean and Manhattan distances are used in case of continuous variables, whereas hamming distance is used in case
of categorical variable.

9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1

10) Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)?
A) 1
B) 2
C) 4
D) 8
Solution: A
sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1

Context: 11-12
Suppose, you have given the following data where x and y are the 2 input variables and Class is the dependent variable.

Below is a scatter plot which shows the above data in 2D space.


11) Suppose, you want to predict the class of new data point x=1 and y=1 using eucludian distance in 3-NN. In which class this
data point belong to?

A) + Class
B) – Class
C) Can’t say
D) None of these
Solution: A
All three nearest point are of +class so this point will be classified as +class.

12) In the previous question, you are now want use 7-NN instead of 3-KNN which of the following x=1 and y=1 will belong to?

A) + Class
B) – Class
C) Can’t say
Solution: B
Now this point will be classified as – class because there are 4 – class and 3 +class point are in nearest circle.

Context 13-14:
Suppose you have given the following 2-class data where “+” represent a postive class and “” is represent negative class.

13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy?
A) 3
B) 5
C) Both have same
D) None of these
Solution: B
5-NN will have least leave one out cross validation error.

14) Which of the following would be the leave on out cross validation accuracy for k=5?
A) 2/14
B) 4/14
C) 6/14
D) 8/14
E) None of the above
Solution: E
In 5-NN we will have 10/14 leave one out cross validation accuracy.

15) Which of the following will be true about k in k-NN in terms of Bias?
A) When you increase the k the bias will be increases
B) When you decrease the k the bias will be increases
C) Can’t say
D) None of these
Solution: A
large K means simple model, simple model always condider as high bias

16) Which of the following will be true about k in k-NN in terms of variance?
A) When you increase the k the variance will increases
B) When you decrease the k the variance will increases
C) Can’t say
D) None of these
Solution: B
Simple model will be consider as less variance model

17) The following two distances(Eucludean Distance and Manhattan Distance) have given to you which generally we used in K-
NN algorithm. These distance are between two points A(x1,y1) and B(x2,Y2).
Your task is to tag the both distance by seeing the following two graphs. Which of the following option is true about below graph
?

A) Left is Manhattan Distance and right is euclidean Distance


B) Left is Euclidean Distance and right is Manhattan Distance
C) Neither left or right are a Manhattan Distance
D) Neither left or right are a Euclidian Distance
Solution: B
Left is the graphical depiction of how euclidean distance works, whereas right one is of Manhattan distance.

18) When you find noise in data which of the following option would you consider in k-NN?
A) I will increase the value of k
B) I will decrease the value of k
C) Noise can not be dependent on value of k
D) None of these
Solution: A
To be more sure of which classifications you make, you can try increasing the value of k.

19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would you consider to
handle such problem?
1. Dimensionality Reduction
2. Feature selection
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
In such case you can use either dimensionality reduction algorithm or the feature selection algorithm

20) Below are two statements given. Which of the following will be true both statements?
1. k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data.
2. The computational complexity for classifying new samples grows linearly with the number of samples in the training
dataset in the worst-case scenario.
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both are true and self explanatory

21) Suppose you have given the following images(1 left, 2 middle and 3 right), Now your task is to find out the value of k in k-
NN in each image where k1 is for 1st, k2 is for 2nd and k3 is for 3rd figure.

A) k1 > k2> k3
B) k1<k2
C) k1 = k2 = k3
D) None of these
Solution: D
Value of k is highest in k3, whereas in k1 it is lowest

22) Which of the following value of k in the following graph would you give least leave one out cross validation accuracy?

A) 1
B) 2
C) 3
D) 5
Solution: B
If you keep the value of k as 2, it gives the lowest cross validation accuracy. You can try this out yourself.

23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed this model on client
side it has been found that the model is not at all accurate. Which of the following thing might gone wrong?
Note: Model has successfully deployed and no technical issues are found at client side except the model performance

A) It is probably a overfitted model


B) It is probably a underfitted model
C) Can’t say
D) None of these
Solution: A
In an overfitted module, it seems to be performing well on training data, but it is not generalized enough to give the same
results on a new data.

24) You have given the following 2 statements, find which of these option is/are true in case of k-NN?
1. In case of very large value of k, we may include points from other classes into the neighborhood.
2. In case of too small value of k the algorithm is very sensitive to noise
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both the options are true and are self explanatory.

25) Which of the following statements is true for k-NN classifiers?


A) The classification accuracy is better with larger values of k
B) The decision boundary is smoother with smaller values of k
C) The decision boundary is linear
D) k-NN does not require an explicit training step
Solution: D
Option A: This is not always true. You have to ensure that the value of k is not too high or not too low.
Option B: This statement is not true. The decision boundary can be a bit jagged
Option C: Same as option B
Option D: This statement is true

26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier?
A) TRUE
B) FALSE
Solution: A
You can implement a 2-NN classifier by ensembling 1-NN classifiers

27) In k-NN what will happen when you increase/decrease the value of k?
A) The boundary becomes smoother with increasing value of K
B) The boundary becomes smoother with decreasing value of K
C) Smoothness of boundary doesn’t dependent on value of K
D) None of these
Solution: A
The decision boundary would become smoother by increasing the value of K

28) Following are the two statements given for k-NN algorthm, which of the statement(s)
is/are true?
1. We can choose optimal value of k with the help of cross validation
2. Euclidean distance treats each feature as equally important
A) 1
B) 2
C) 1 and 2
D) None of these
Solution: C
Both the statements are true

Context 29-30:
Suppose, you have trained a k-NN model and now you want to get the prediction on test data. Before getting the prediction
suppose you want to calculate the time taken by k-NN for predicting the class for test data.
Note: Calculating the distance between 2 observation will take D time.
29) What would be the time taken by 1-NN if there are N(Very large) observations in test data?
A) N*D
B) N*D*2
C) (N*D)/2
D) None of these
Solution: A
The value of N is very large, so option A is correct

30) What would be the relation between the time taken by 1-NN,2-NN,3-NN.
A) 1-NN >2-NN >3-NN
B) 1-NN < 2-NN < 3-NN
C) 1-NN ~ 2-NN ~ 3-NN
D) None of these
Solution: C
The training time for any value of k in kNN algorithm is the same.

Overall Distribution
Below is the distribution of the scores of the participants:

You can access the scores here. More than 250 people participated in the skill test and the highest score obtained was 24.

31. What is “K” in KNN algorithm?

K = Number of nearest neighbors you want to select to predict the class of a given item

32. How do we decide the value of "K" in KNN algorithm?

If K is small, then results might not be reliable because noise will have a higher influence on the result. If K is large, then there
will be a lot of processing which may adversely impact the performance of the algorithm. So, following is must be considered
while choosing the value of K:

a. K should be the square root of n (number of data points in training dataset)


b. K should be odd so that there are no ties. If square root is even, then add or subtract 1 to it.

More details...

33. Why is the odd value of “K” preferable in KNN algorithm?

K should be odd so that there are no ties in the voting. If square root of number of data points is even, then add or subtract 1 to
it to make it odd.

34. What is the difference between Euclidean Distance and Manhattan distance? What is the formula of Euclidean distance and
Manhattan distance?

Both are used to find out the distance between two points.
Euclidean Distance and Manhattan Distance Formula

It should also be noted that all three distance measures are only valid for continuous variables. In the instance of categorical
variables the Hamming distance must be used. It also brings up the issue of standardization of the numerical variables between
0 and 1 when there is a mixture of numerical and categorical variables in the dataset.

35. Why is KNN algorithm called Lazy Learner?

When it gets the training data, it does not learn and make a model, it just stores the data. It does not derive any discriminative
function from the training data. It uses the training data when it actually needs to do some prediction. So, KNN does not
immediately learn a model, but delays the learning, that is why it is called lazy learner.

36. Why should we not use KNN algorithm for large datasets?

KNN works well with smaller dataset because it is a lazy learner. It needs to store all the data and then makes decision only at
run time. It needs to calculate the distance of a given point with all other points. So if dataset is large, there will be a lot of
processing which may adversely impact the performance of the algorithm.

KNN is also very sensitive to noise in the dataset. If the dataset is large, there are chances of noise in the dataset which
adversely affect the performance of KNN algorithm.

37. What are the advantages and disadvantages of KNN algorithm?

Answer

38. Explain KNN with an example


Consider the following data concerning credit default. Age and Loan are two numerical variables (predictors) and Default is the
target.
We can now use the training set to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1
then the nearest neighbor is the last case in the training set with Default=Y.

D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y

With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is
again Default=Y.
Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good
sign of robustness.

Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case where variables have different
measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on
annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance
calculated. One solution is to standardize the training set as shown below.

39. Is there a need to standardize the data before applying KNN Algorithm?
Yes. As the complete algorithm is based upon the concept of calculating distances the features on a different scale can impact
the outcomes so its always a mandate to standardize the data before applying algorithm.

40. Suppose we have a business requirement that we cannot convert or transform a categorical variable. Can we use KNN in
that case?
Yes. We could use KNN in such a case. We would need to change the default distance calculation method to hammings
distance.

41. Briefly explain how does KNN works in background?


The kNN algorithm is belongs to the family of instance-based, competitive learning and lazy learning algorithms.

Instance-based algorithms are those algorithms that model the problem using data instances (or rows) in order to make
predictive decisions. The kNN algorithm is an extreme form of instance-based methods because all training observations are
retained as part of the model.

It is a competitive learning algorithm, because it internally uses competition between model elements (data instances) in order
to make a predictive decision. The objective similarity measure between data instances causes each data instance to compete
to “win” or be most similar to a given unseen data instance and contribute to a prediction.

Lazy learning refers to the fact that the algorithm does not build a model until the time that a prediction is required. It is lazy
because it only does work at the last second. This has the benefit of only including data relevant to the unseen data, called a
localized model. A disadvantage is that it can be computationally expensive to repeat the same or similar searches over larger
training datasets.

Finally, kNN is powerful because it does not assume anything about the data, other than a distance measure can be calculated
consistently between any two instances. As such, it is called non-parametric or non-linear as it does not assume a functional
form.

42. What are the steps to implement KNN?


This tutorial is broken down into the following steps:

 Handle Data: Open the dataset from CSV and split into test/train datasets.
 Similarity: Calculate the distance between two data instances.
 Neighbors: Locate k most similar data instances.
 Response: Generate a response from a set of data instances.
 Accuracy: Summarize the accuracy of predictions.
 Main: Tie it all together.
Link :- https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-
python-from-scratch/
43. How do we select the optimal value of K?
The second step is to select the k value. This determines the number of neighbors we look at when we assign a value to any
new observation.
In our example, for a value k = 3, the closest points are ID1, ID5 and ID6.

The prediction of weight for ID11 will be:


ID11 = (77+72+60)/3

ID11 = 69.66 kg
For the value of k=5, the closest point will be ID1, ID4, ID5, ID6, ID10.

The prediction for ID11 will be :


ID 11 = (77+59+72+60+58)/5

ID 11 = 65.2 kg
We notice that based on the k value, the final result tends to change. Then how can we figure out the optimum value of k? Let
us decide it based on the error calculation for our train and validation set (after all, minimizing the error is our final goal!).
Have a look at the below graphs for training error and validation error for different values of k.
For a very low value of k (suppose k=1), the model overfits on the training data, which leads to a high error rate on the
validation set. On the other hand, for a high value of k, the model performs poorly on both train and validation set. If you
observe closely, the validation error curve reaches a minima at a value of k = 9. This value of k is the optimum value of the
model (it will vary for different datasets). This curve is known as an ‘elbow curve‘ (because it has a shape like an elbow) and is
usually used to determine the k value.

44. What are potential problems with implementing kNN on a very large dataset?
One must understand what operations happen during each iteration of the algorithm. For each new data point, the kNN
classifier must:
 Calculate the distances to all points in the training set and store them
 Sort the calculated distances
 Store the K nearest points
 Calculate the proportions of each class
 Assign the class with the highest proportion
Obviously this is a very taxing process, both in terms of time and space complexity. The first operation is a quadratic time
process, and the sorting a O(nlogn) process. Together, one could say that the process is a O(n³logn) process; a monstrously
long process indeed.
Another problem is memory, since all pairwise distances must be stored and sorted in memory on a machine. With very large
datasets, local machines will usually crash.

45. What are some ways of getting around the kNN-specific problems?

Solution #1: Get more resources (computing power or larger memory).


This is obviously not the best answer to a scalability question, and not really applicable in real-life, industry problems.
Solution #2: Preprocessing the data.
Dimensionality reduction (via PCA (principal component analysis), or feature selection) to reduce the complexity of the distance
calculation. You can also use clustering algorithms (like K-means or Rocchio) to reduce the number of points used to compute
distances and sort, as illustrated below. In this case, the nontrivial task becomes assigning the test set point to the correct
cluster.

Source:https://www.researchgate.net/profile/Zhaolei_Zhang/publication/45855946/figure/fig7/
AS:307384896507918@1450297678757/Figure-7-Two-dimesional-embedding-of-10000-MNIST-test-data-using-the-Deep-
Neural.png
Solution #3: Random sampling to reduce training set size.
If you use a good random number generator and a decently large sample size, the sample should be a fairly good
representation of the original.

46. If using random sampling only once and supposing we know a good k value to use for the original data, how
should k be adjusted in accordance to a change in the input size?

Answer:
Two important points must be clarified to tackle this problem:
 What effect does sampling have on the kNN model?
 What effect does changing k have on the kNN model?
Point #1: Effects of sampling:

ggplot2 comparison of a sample 20% the size of the original dataset(iris)


As illustrated above, sampling does several things in the perspective of a single data point, since kNN works on a point-by-point
basis.
 The average distance to the k nearest neighbors increases due to increased sparsity in the dataset.
 Consequently, the area covered by k-nearest neighbors increases in size and covers a larger area of the feature
space.
 The sample variance increases.
A consequence to this change in input is an increase in variance. When we talk of variance, we refer to the variability in the
predictions given different samples from the population. Why would the immediate effects of sampling lead to increased
variance of the model?
Notice that now a larger area of the feature space is represented by the same k data points. While our sample size has not
grown, the population space that it represents has increased in size. This will result in higher variance in the proportion of
classes in the k nearest data points, and consequently a higher variance in the classification of each data point.
Point #2: Effects of Changing the k Parameter in kNN
Let us first examine the visual changes of changing k from k=1 to k=5 on a particular dataset.

Source: http://cs231n.github.io/assets/knn.jpeg
Notice from the comparison that:
 The number of distinct regions (in terms of color) goes down when the k parameter increases.
 The class boundaries of the predictions become more smooth as k increases.
What really is the significance of these effects? First, it gives hints that a lower k value makes the kNN model more “sensitive.”
That is, it is more sensitive to the local changes in the dataset. The “sensitivity” of the model directly translates to its variance.
All of these examples point to an inverse relationship between variance and k. Additionally, consider how kNN operates
when k reaches its maximum value, k=n, where n is the number of points in the training set) In this case, the majority class in
the training set will always dominate the predictions. It will simply pick the most abundant class in the data, and never deviate,
effectively resulting in zero variance.Therefore it seems to reduce variance, k must be increased.
Final Verdict: In order to offset the increased variance due to sampling, k can be increased to decrease model variance.
47. If not restricted to a single sample, what could be a fairly simple method to reduce increased variance of the model
other than changing k?

Answer:
If not restricted in the number of times, one can draw samples from the original dataset, a simple variance reduction method
would be to sample,many times, and then simply take a majority vote of the kNN models fit to each of these samples to classify
each test data point. This variance reduction method is called bagging. You might have heard of bagging, since it is the core
concept in randomforest, a very popular tree ensemble method. We will explore this technique in greater detail in future posts.

Source:https://camo.githubusercontent.com/
715bf357ca41278b432ca0282908c2a4e0ee8c22/687474703a2f2f696d616765732e7363686f6c617270656469612e6f72672f77
2f696d616765732f382f38322f436f6d62696e696e675f636c617373696669657273322e6a7067

48. Why KNN is non-parametric?

Non-parametric means not making any assumptions on the underlying data distribution. Non-parametric methods do not have
fixed numbers of parameters in the model. Similarly in KNN, model parameters actually grows with the training data set - you
can imagine each training case as a "parameter" in the model.

49. KNN vs. K-mean


Many people get confused between these two statistical techniques- K-mean and K-nearest neighbor. See some of the
difference below -

1. K-mean is an unsupervised learning technique (no dependent variable) whereas KNN is a supervised learning
algorithm (dependent variable exists)
2. K-mean is a clustering technique which tries to split data points into K-clusters such that the points in each cluster
tend to be near each other whereas K-nearest neighbor tries to determine the classification of a point, combines the
classification of the K nearest points

50. Can KNN be used for regression?

Yes, K-nearest neighbor can be used for regression. In other words, K-nearest neighbor algorithm can be applied when
dependent variable is continuous. In this case, the predicted value is the average of the values of its k nearest neighbors.

51. Discuss some pros and cons of KNN Algorithm?


Pros

1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems

Cons

1. Memory Intensive / Computationally expensive


2. Sensitive to scale of data
3. Not work well on rare event (skewed) target variable
4. Struggle when high number of independent variables
For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting k to a large value
may lead to a large model bias.

52. How to handle categorical variables in KNN?

Create dummy variables out of a categorical variable and include them instead of original categorical variable. Unlike
regression, create k dummies instead of (k-1). For example, a categorical variable named "Department" has 5 unique levels /
categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0.

53. How does K Cross Fold Validation works?


Cross-validation is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest
are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated
until each unique group as been used as the test set.
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would be trained and tested 5
separate times so each group would get a chance to be the test set. This can be seen in the graph below.

5-fold cross validation (image credit)


The train-test-split method we used in earlier is called ‘holdout’. Cross-validation is better than using the holdout method
because the holdout method score is dependent on how the data is split into train and test sets. Cross-validation gives the
model an opportunity to test on multiple splits so we can get a better idea on how the model will perform on unseen data.
In order to train and test our model using cross-validation, we will use the ‘cross_val_score’ function with a cross-validation
value of 5. ‘cross_val_score’ takes in our k-NN model and our data as parameters. Then it splits our data into 5 groups and fits
and scores our data 5 seperate times, recording the accuracy score in an array each time. We will save the accuracy scores in
the ‘cv_scores’ variable.
To find the average of the 5 scores, we will use numpy’s mean function, passing in ‘cv_score’. Numpy is a useful math library in
Python.
from sklearn.model_selection import cross_val_score
import numpy as np#create a new KNN model
knn_cv = KNeighborsClassifier(n_neighbors=3)#train model with cv of 5
cv_scores = cross_val_score(knn_cv, X, y, cv=5)#print each cv score (accuracy) and average them
print(cv_scores)
print(‘cv_scores mean:{}’.format(np.mean(cv_scores)))

Using cross-validation, our mean score is about 71.36%. This is a more accurate representation of how our model will perform
on unseen data than our earlier testing using the holdout method.

54. How can you optimize the KNN Model?


We could do the hyperparameter tuning using the GridSearchCV
Hypertuning parameters is when you go through a process to find the optimal parameters for your model to improve accuracy.
In our case, we will use GridSearchCV to find the optimal value for ‘n_neighbors’.
GridSearchCV works by training our model multiple times on a range of parameters that we specify. That way, we can test our
model with each parameter and figure out the optimal values to get the best accuracy results.
For our model, we will specify a range of values for ‘n_neighbors’ in order to see which value works best for our model. To do
this, we will create a dictionary, setting ‘n_neighbors’ as the key and using numpy to create an array of values from 1 to 24.
Our new model using grid search will take in a new k-NN classifier, our param_grid and a cross-validation value of 5 in order to
find the optimal value for ‘n_neighbors’.
from sklearn.model_selection import GridSearchCV#create new a knn model
knn2 = KNeighborsClassifier()#create a dictionary of all values we want to test for n_neighbors
param_grid = {‘n_neighbors’: np.arange(1, 25)}#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn2, param_grid, cv=5)#fit model to data
knn_gscv.fit(X, y)

After training, we can check which of our values for ‘n_neighbors’ that we tested performed the best. To do this, we will call
‘best_params_’ on our model.
#check top performing n_neighbors value
knn_gscv.best_params_

55. Explain KNN Algorithm in brief?


 KNN model
1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown
iris
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the
unknown iris
 This would always have 100% accuracy, because we are testing on the exact same data, it would always make
correct predictions
 KNN would search for one nearest observation and find that exact same observation
 KNN has memorized the training set
 Because we testing on the exact same data, it would always make the same prediction

56. What are the problems with training and testing on the same data?
 Goal is to estimate likely performance of a model on out-of-sample data
 But, maximizing training accuracy rewards overly complex models that won't necessarily generalize
 Unnecessarily complex models overfit the training data

Image Credit: Overfitting by Chabacano. Licensed under GFDL via Wikimedia Commons.
 Green line (decision boundary): overfit
 Your accuracy would be high but may not generalize well for future observations
 Your accuracy is high because it is perfect in classifying your training data but not out-of-sample data
 Black line (decision boundary): just right
 Good for generalizing for future observations
57. Give an example of how do you split the dataset provided in to train test? What do you accomplish by using it?
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)
 test_size=0.4
 40% of observations to test set
 60% of observations to training set
 data is randomly assigned unless you use random_state hyperparameter
 If you use random_state=4
o Your data will be split exactly the same way

What did this accomplish?


 Model can be trained and tested on different data
 Response values are known for the testing set, and thus predictions can be evaluated
 Testing accuracy is a better estimate than training accuracy of out-of-sample performance

58. Give some pointers on training and testing accuracy based on model complexity? What is the indicator of
complexity in KNN Algorithm?
 Training accuracy rises as model complexity increases
 Testing accuracy penalizes models that are too complex or not complex enough
 For KNN models, complexity is determined by the value of K (lower value = more complex)

59. What are the downsides of train test split?


 Provides a high-variance estimate of out-of-sample accuracy
 K-fold cross-validation overcomes this limitation
 But, train/test split is still useful because of its flexibility and speed

60. What are the accuracy measures for KNN?


1. First step is to identify the correct value of K( we can use GridSearchCV or RandomSearchCV as well for hyperparameter
tuning)
2. Once we have finalized the value of K we could use the classification matrix, TPR, FPR, Senstivity,Specificity to get the
accuracy of the model.

61, What do you understand by the term precision and recall?


The recall is alternatively called a true positive rate. It refers to the number of positives that have been claimed by your model
compared to the number of positives that are available throughout the data.
Precision, which is alternatively called a positive predicted value, is based on prediction. It is a measurement of the number of
accurate positives that the model has claimed as compared to the number of positives that the model has actually claimed.

62. Why does choosing the right K value is critical?


 Overly large k values lead to imprecise models which could lead to many misclassified points.
 Small k values lead to overly precise models that may fail to see the forest for the trees. The jagged and convoluted
boundaries between classes reduce the interpretability of the model and lead to potentially incorrect classifications.

63. What is the no free lunch theorem in ML world?


None of the algorithms score perfectly in all of the criteria, although a case can be made for using each of them. It all depends
on the data scientist’s priorities and the specific circumstances in which the problem has to be solved.

You might also like