Data Science Intervieew Questions
Data Science Intervieew Questions
Variance:
“Variance is error introduced in your model due to complex machine learning algorithm, your
model learns noise also from the training data set and performs bad on test data set.” It can lead
high sensitivity and over fitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to
lower bias in the model. However, this only happens till a particular point. As you continue to make
your model more complex, you end up over-fitting your model and hence your model will start
suffering from high variance.
A data set used for performance evaluation is called test data set. It should contain the correct labels
and predicted labels.
The predicted labels will exactly the same if the performance of a binary classifier is perfect.
The predicted labels usually match with part of the observed labels in real world scenarios.
A binary classifier predicts all data instances of a test dataset as either positive or negative. This
produces four outcomes-
1. True positive(TP) — Correct positive prediction
2. False positive(FP) — Incorrect positive prediction
3. True negative(TN) — Correct negative prediction
4. False negative(FN) — Incorrect negative prediction
Basic measures derived from the confusion matrix
1. Error Rate = (FP+FN)/(P+N)
2. Accuracy = (TP+TN)/(P+N)
3. Sensitivity(Recall or True positive rate) = TP/P
4. Specificity(True negative rate) = TN/N
5. Precision(Positive predicted value) = TP/(TP+FP)
6. F-Score(Harmonic mean of precision and recall) = (1+b)(PREC.REC)/(b²PREC+REC)
where b is commonly 0.5, 1, 2.
Boosting
If an observation was classified incorrectly, it tries to increase the weight of this observation and
vice versa. Boosting in general decreases the bias error and builds strong predictive models.
However, they may over fit on the training data.
15. What is Random Forest? How does it work ?
Random forest is a versatile machine learning method capable of performing both regression and
classification tasks. It is also used for dimentionality reduction, treats missing values, outlier values.
It is a type of ensemble learning method, where a group of weak models combine to form a
powerful model.
In Random Forest, we grow multiple trees as opposed to a single tree. To classify a new object
based on attributes, each tree gives a classification. The forest chooses the classification having the
most votes(Over all the trees in the forest) and in case of regression, it takes the average of outputs
by different trees.
16. What cross-validation technique would you use on a time series data set.
Instead of using k-fold cross-validation, you should be aware to the fact that a time series is not
randomly distributed data — It is inherently ordered by chronological order.
In case of time series data, you should use techniques like forward chaining — Where you will be
model on past data then look at forward-facing data.
fold 1: training[1], test[2]
fold 1: training[1 2], test[3]
fold 1: training[1 2 3], test[4]
fold 1: training[1 2 3 4], test[5]
17. What is logistic regression? Or State an example when you have used logistic regression
recently.
Logistic Regression often referred as logit model is a technique to predict the binary outcome from
a linear combination of predictor variables. For example, if you want to predict whether a particular
political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or
1 (Win/Lose). The predictor variables here would be the amount of money spent for election
campaigning of a particular candidate, the amount of time spent in campaigning, etc.
Data is usually distributed in different ways with a bias to the left or to the right or it can all be
jumbled up. However, there are chances that data is distributed around a central value without any
bias to the left or right and reaches normal distribution in the form of a bell shaped curve. The
random variables are distributed in the form of an symmetrical bell shaped curve.
A Box Cox transformation is a way to transform non-normal dependent variables into a normal
shape. Normality is an important assumption for many statistical techniques, if your data isn’t
normal, applying a Box-Cox means that you are able to run a broader number of tests. The Box Cox
transformation is named after statisticians George Box and Sir David Roxbee Cox who collaborated
on a 1964 paper and developed the technique.
20. How will you define the number of clusters in a clustering algorithm?
Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to
K-Means clustering where “K” defines the number of clusters. For example, the following image
shows three different groups.
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot
WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally
known as Elbow Curve.
Red circled point in above graph i.e. Number of Cluster =6 is the point after which you don’t see
any decrement in WSS. This point is known as bending point and taken as K in K — Means.This is
the widely used approach but few data scientists also use Hierarchical clustering first to create
dendograms and identify the distinct groups from there.
Recurrent networks on the other hand, take as their input not just the current input example they
see, but also the what they have perceived previously in time. The BTSXPE at the bottom of the
drawing represents the input example in the current moment, and CONTEXT UNIT represents the
output of the previous moment. The decision a recurrent neural network reached at time t-1 affects
the decision that it will reach one moment later at time t. So recurrent networks have two sources of
input, the present and the recent past, which combine to determine how they respond to new data,
much as we do in life.
The error they generate will return via back propagation and be used to adjust their weights until
error can’t go any lower. Remember, the purpose of recurrent nets is to accurately classify
sequential input. We rely on the back propagation of error and gradient descent to do so.
Back propagation in feed forward networks moves backward from the final error through the
outputs, weights and inputs of each hidden layer, assigning those weights responsibility for a
portion of the error by calculating their partial derivatives — ∂E/∂w, or the relationship between
their rates of change. Those derivatives are then used by our learning rule, gradient descent, to
adjust the weights up or down, whichever direction decreases error.
Recurrent networks rely on an extension of back propagation called back propagation through time,
or BPTT. Time, in this case, is simply expressed by a well-defined, ordered series of calculations
linking one time step to the next, which is all back propagation needs to work.
23. What is the difference between machine learning and deep learning?
Machine learning:
Machine learning is a field of computer science that gives computers the ability to learn without
being explicitly programmed. Machine learning can be categorised in following three categories.
1. Supervised machine learning,
2. Unsupervised machine learning,
3. Reinforcement learning
Deep learning:
Deep Learning is a sub field of machine learning concerned with algorithms inspired by the
structure and function of the brain called artificial neural networks.
Reinforcement Learning is learning what to do and how to map situations to actions. The end result
is to maximise the numerical reward signal. The learner is not told which action to take, but instead
must discover which action will yield the maximum reward.Reinforcement learning is inspired by
the learning of human beings, it is based on the reward/panelity mechanism.
30. If you are having 4GB RAM in your machine and you want to train your model on 10GB
data set. How would you go about this problem. Have you ever faced this kind of problem in
your machine learning/data science experience so far ?
First of all you have to ask which ML model you want to train.
For Neural networks: Batch size with Numpy array will work.
Steps:
1. Load the whole data in Numpy array. Numpy array has property to create mapping of
complete data set, it doesn’t load complete data set in memory.
2. You can pass index to Numpy array to get required data.
3. Use this data to pass to Neural network.
4. Have small batch size.
For SVM: Partial fit will work
Steps:
1. Divide one big data set in small size data sets.
2. Use partial fit method of SVM, it requires subset of complete data set.
3. Repeat step 2 for other subsets.
What is Naive ?
The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to be correct.
33. Why we generally use Softmax non-linearity function as last operation in network ?
It is because it takes in a vector of real numbers and returns a probability distribution. Its definition
is as follows. Let x be a vector of real numbers (positive, negative, whatever, there are no
constraints). Then the i’th component of Softmax(x) is —
It should be clear that the output is a probability distribution: each element is non-negative and the
sum over all components is 1.