Interview Questions
Interview Questions
Ans:-Here we plot each data point in n-dimensional space with the value of each dimension
being the value of a particular coordinate. Then, we perform classification by finding the
hyper-plane that differentiate the classes very well
3 What are the tuning parameters of SVM?
Ans:-Kernel, Regularization, Gamma and Margin are the tuning parameters of SVM
4 Explain Kernel in SVM?
Ans:-Kernel tricks are nothing but the transformations applied on input variables that
separate non-separable data to separable data. There are 9 different kernel tricks. Examples
are Linear, RBF, Polynomial, etc.
5 Is there a need to convert categorical variables into numeric in SVM? If yes, explain.
Ans:-All the categorical variables have to be converted to numeric by creating dummy
variables, as all the data points have to be plotted on n-dimensional space, in addition to
this we have tuning parameters like Kernel, Regularization, Gamma & Margin which are
mathematical computations that require numeric variables. This is an assumption of SVM.
6 What is Regularization in SVM?
Ans:-The value of the Regularization parameter tells the training model as to how much it
can avoid misclassifying each training observation.
7 What is the Gamma parameter in SVM?
Ans:-Gamma is the kernel coefficient in the kernel tricks RBF, Polynomial, & Sigmoid. Higher
values of Gamma will make the model more complex and overfits the model.
8 What is the SVM package used for SVM in R?
Ans:-kernlab is the package used in R for implementing SVM in R
9 What is the function name to implement SVM in R?
Ans:- ksvm is the function in R to implement SVM in R
10 What is a decision tree?
Ans:-Decision Tree is a supervised machine learning algorithm used for classification and
regression analysis. It is a tree-like structure in which an internal node represents a test on
an attribute, each branch represents the outcome of the test and each leaf node represents
class label.
11 What are the rules in a decision tree?
Ans: A path from the root node to a leaf node represents classification rules
12 Explain different types of nodes in nodes in the decision tree and how are they selected?
Ans:-We have Root Node, Internal Node, Leaf Node in a decision tree. Decision Tree starts
at the Root Node, this is the first node of the decision tree. Data set is split based on Root
Node, again nodes are selected to further split the already splitted data. This process of
splitting the data goes on till we get leaf nodes, which are nothing but the classification
labels. The process of selecting Root Nodes and Internal Nodes is done using the statistical
measure called as Gain
13 What do you mean by impurity in Decision Tree?
Ans:-We say a data set is pure or homogenous if all of it's class labels is the same and
impure or heterogenous if the class labels are different. Entropy or Gini Index or
Classification Error can be used to measure the impurity of the data set.
14 What is Pruning in Decision Tree?
Ans:-The process of removal of sub-nodes which contribute less power to the decision tree
model is called as Pruning.
15 What is the advantage of Pruning?
Ans:-Pruning reduces the complexity of the model which in turn reduces the overfitting
problem of Decision Tree. There are two strategies in Pruning. Propruning - discard
unreliable parts from the fully grown tree, Prepruning - stop growing a branch when the
information becomes unreliable. Post pruning is the preferred one.
16 What is the difference between Entropy and Information Gain?
Ans:-Entropy is a probabilistic measure of uncertainty or impurity whereas Information Gain
is the reduction of this uncertainty measure.
17 Explain the expression of Gain (of any column)?
Ans:-Gain for any column is calculated by differencing Information Gain of a dataset with
respect to a variable from the Information Gain of the entire dataset i.e., Gain(Age) =
Info(D) - Info(D wrt Age)
18 What is the package required to implement Decision Tree in R?
Ans:-C50 and tree packages can be used to implement a decision tree algorithm in R.
19 What is a Random Forest?
Answer
Random Forest is an Ensemble Classifier. As opposed to building a single decision tree,
random forest builds many decision trees and combines the output of all the decision trees
to give a stable output.
How does Random Forest adds randomness and build a better model?
Answer
Instead of searching for the most important feature while splitting a node, it searches for
the best feature among a random subset of features. This results in a wide diversity that
generally results in a better model. Additional randomness can be added by using random
thresholds for each feature rather than searching for the best possible thresholds (like a
normal decision tree does).
What are the pros of using Random Forest?
Answer
Random Forest won't overfit the model, it is unexcelled in reliable accuracy, works very
well on large data sets, can handle thousands of input variables without deletion, outputs
significance of input variables, handles outliers and missing values very well
What is the limitation of Random Forest?
Answer
The main limitation of Random Forest is that a large number of trees can make the
algorithm to slow and ineffective for real-time predictions. In most real-world applications
the random forest algorithm is fast enough, but there can certainly be situations where run-
time performance is important and other approaches would be preferred.
What is a Neural Network?
Answer
Neural Network is a supervised machine learning algorithm that is inspired by the
human nervous system and it replicates the similar to how the human brain is trained. It
consists of Input Layers, Hidden Layers, & Output Layers.
What are the various types of Neural Networks?
Answer
Artificial Neural Network, Recurrent Neural Networks, Convolutional Neural Networks,
Boltzmann Machine Networks, Hopfield Networks are examples of the Neural Networks.
There are a few other types as well.
What is the use of activation functions in neural network?
Answer
The activation function is used to convert an input signal of a node in an A-NN to an
output signal. That output signal now is used as an input in the next layer in the stack.
What are the different types of activation functions in neural network?
Answer
Sigmoid or Logistic, Tanh or Hyperbolic tangent, ReLu or Rectified Linear units are
examples of activation functions in neural network
What is the package name to implement a neural network in R?
Answer
neuralnet package can be used to implement a neural network in R
Which among the following prevents overfitting when we perform bagging?
A: The use of sampling with replacement as the sampling technique
B: The use of weak classifiers
C: The use of classification algorithms which are not prone to overfitting
D: The practice of validation performed on every classifier trained
Answer: B
Explanation: The presence of over-training (which leads to overfitting) is not generally a
problem with weak classifiers. For example, in decision trees with only one node (the root
node), there is no real scope for overfitting. This helps the classifier which combines the
outputs of weak classifiers in avoiding overfitting.
Sum of weights of the principal component in PCA analysis is
A) <1
B) 1
C) >1
D) None of the above
Answer: B
Which of the following testing is concerned with making decisions using data?
a)Probability
b)Hypothesis
c)Causal
d)None of the above
Answer: B
Explanation: The null hypothesis is assumed true and statistical evidence is required to
reject it in favor of a research or alternative hypothesis.
Which of the following combination is correct?
A: Continuous – euclidean distance
B: Continuous – correlation similarity
C: Binary –Jaquard’s coefficient
D: All the above
Answer: D
Explanation: You should choose a distance/similarity that makes sense for your problem.
Which of the following is the correct use of cross-validation?
A: Selecting variables to include in a model
B: Comparing predictors
C: Selecting parameters in the prediction function
D: All of the Mentioned
Answer: D
Explanation: Cross-validation is also used to pick type of prediction function to be used.
Why data cleaning plays a vital role in analysis?
Answer
Cleaning data from multiple sources to transform it into a format that data analysts or
data scientists can work with is a cumbersome process because - as the number of data
sources increases, the time take to clean the data increases exponentially due to the
number of sources and the volume of data generated in these sources. It might take up to
80% of the time for just cleaning data making it a critical part of analysis task.
What are Recommender Systems?
Answer
A subclass of information filtering systems that are meant to predict the preferences or
ratings that a user would give to a product. Recommender systems are widely used in
movies, news, research articles, products, social tags, music, etc.
What is logistic regression? Or State an example when you have used logistic regression
recently?
Answer
Logistic Regression often referred to as the logit model is a technique to predict the binary
outcome from a linear combination of predictor variables. For example, if you want to
predict whether a particular political leader will win the election or not. In this case, the
outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be
the amount of money spent on election campaigning of a particular candidate, the amount
of time spent in campaigning, etc.
Answer :-
mean() function can be used to compute the accuracy. Within parenthesis actual labels
have to compared with predicted labels
Functions to find row & column count in R?
Answer :-
dim() function or nrow() & col() can be used to find row & column count
What is the function to perform simple random sampling?
Answer :-
sample() is the function in R to employ Simple Random Sampling
What is Joint probability?
Answer :-
It is the probability of two events occuring at the same time. Classical example is probability
of an email being spam wih the word lottery in it.Here the events are email being spam and
email having the word lottery
What is Probability?
Answer :-
Probability is given by Number of interested events/Total number of events
Can we represent the output of a classifer having more than two levels using a confusion
matrix?
Answer :-
We cannot use confusion matix when we have more than two levels in the output variable.
Instead, we can use crosstable() function from gmodels package
Difference between lapply & sapply function?
Answer :-
lapply returns the ouput as a list whereas sapply returns the ouput as a vector, matrix or
array.
What is the r function to know the persentae of observations for the levels of a variable?`
Answer :-
prop.table() employed on top of table() function i.e., prop.table(table()) is the r function. It
can be also be employed on any variable but it makes sense to employ on a factor variable.
What is the r function to know the number of observations for the levels of a variable?`
Answer :-
table() is the r function. It can be also be employed on any variable but it makes sense to
employ on a factor variable.
Function in R to employ KNN?
Answer :-
knn() can be used from the class package
How do we choose the value of K in KNN algorithm?
Answer :-
K value can be selected using sqrt(no. of obs/2), kselection package, scree plot, k fold cross
validation
Why is KNN called as Lazy Algorithm?
Answer :-
There is no or minimal training phase because of which training phase is pretty fast. Here
the training data is used during the testing phase.
Why is KNN called as non-parametric algorithm?
Answer :-
KNN makes no assumptions about the underlying data (unlike other algorithms, eg. Linear
Regression)
What is the use of set.seed() function ?
Answer :-
set.seed() function is to reproduce same results if the code is re-run again. Any number can
be given within the paranthesis
How to interpret clusterig output?
Answer :-
After computing optimal clusters, aggregate measure like mean has to be computed on all
variables and then resultant values for all the variables have to be interpreted among the
clusters
How do we decide upon the number of clusters in hierarchial clustering?
Answer :-
In Hierarchial Clustering number of clusters will be decided only after looking at the
dendrogram.
What are linkages in hierarchical clustering?
Answer :-
Linkage is the criteria based on which distances between two clusters is computed. Single,
Complete, Average are few of the examples for linkages
Single - The distance between two clusters is defined as the shortest distance between two
points in each cluster.
Complete - The distance between two clusters is defined as the longest distance between
two points in each cluster.
Average - the distance between two clusters is defined as the average distance between
each point in one cluster to every point in the other cluster.
Packages to read excel files in R?
Answer :-
readxl or xlsx packages can be used to read excel files in R
What is str() command why is it required to run it?
Answer :-
str() command gives dimensions for your data drame. In addition to this it gives, class of the
dataset & class of every variable
What does summary() command gives?
Answer :-
summary() command gives the distribution for numerical variables and proportion of
observations for factor variables
What is the range of variable when ((x - min(X))/(max(X) - min(X)) normalization technique
is employed?
Answer :-
0 to 1 is the range for this normalizaion technique
What is the range of Z transformed variable?
Answer :-
Theoretically it will be between - infinity to + inifinity but normally you have values between
-3 to +3
Is normalization of data required before applying clustering?
Answer :-
It would be better if we employ clustering on normalized data as you will get different
results for with and without normalization
Example of clustering?
Answer :-
Using variables like income, education, profession, age, number of children, etc you come
with different clusters and each cluster has people with similar socio-economic criteria
In which domains can we employ clustering?
Answer :-
None of your data science topics are domain specific. They can be employed in any domain,
provided data is available.
When can you say that resultant clusters are good?
Answer :-
When the clusters are as much heterogenous as possible and when the observations within
each cluster are as much homogeenous as possible.
Why is hierarchial clustering called as Agglomerative clustering?
Answer :-
It is because of bottom up approach, where initially each observation is considered to be a
single cluster and gradually based on the distance measure inidividual clusters will be paired
and finally merged as one
Examples of Supervised Machine Learning
Answer :-
KNN, Naive Bayes, SVM, Decision Tree, Random Forest, Neural Network
Examples of Unsupervised Machine Learning
Answer :-
Segmentation, PCA, SVD, Market Basket Analysis, Recommender Systems
What is Classification Modeling?
Answer :-
Classification Models are employed when the observations have to be classified in
categories and not predicted.
Examples being Cancerous and Non-cancerous tumor (2 categories), Bus, Rail, Car, Carpool
(>2 categories)
What is Unsupervised Machine Learning?
Answer :-
In this category of Machine Learning, there won’t be any output variable to be either
predicted or classified. Instead the algorithm understands the patterns in the data.
Examples: Segmentation, PCA, SVD, Market Basket Analysis, Recommender Systems.
What is Supervised Machine Learning?
Answer :-
Supervised Machine Learning will be employed for the problem statements where in output
variable (Nominal) of interest can be either classified or predicted.
Examples: KNN, Naive Bayes, SVM, Decision Tree, Random Forest, Neural Network
What is Machine Learning?
Answer :-
Machine learning is the science of getting computers to act without being explicitly
programmed. Machine learning has given us self-driving cars, practical speech recognition,
effective web search, and a vastly improved understanding of the human genome. It is so
widespread that unknowingly we use it many a times in our daily life.
Differentiate between univariate, bivariate and multivariate analysis.
Answer :-
These are descriptive statistical analysis techniques which can be differentiated based on
the number of variables involved at a given point of time. For example, the pie charts of
sales based on territory involve only one variable and can be referred to as univariate
analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a
scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of
sale and a spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of
variables on the responses is referred to as multivariate analysis.