0% found this document useful (0 votes)
7 views

Data Science

Uploaded by

Saravana Kohli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Science

Uploaded by

Saravana Kohli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Important Tips

Datascience interview questions can include questions from statistics, math,


data visualization, analytics, software engineering, baics ML concepts, ML
models etc. Type of questions can also vary from being fiexed answer
questions or open ended questions where multiple solutions are possible.
Please DO NOT TRY TO REMEMBER these question answers. Instead use
this notebook to test your knowledge and accordingly work on the weak
areas.
Try to understand what interviewer is trying to know from each question and
explain it in your own words.
Best way of learning anything new is to read about it -> understand the
concepts -> try it your self -> if possible publish your learning and
colloborate with other learners.
I will keep updating this notebook as and when I come across interesting
questions.

Suppose you had bank transaction


data, and wanted to separate out likely
fraudulent transactions. How would
you approach it? Why might accuracy
be a bad metric for evaluating
success?
In Machine Learning, problems like fraud detection are usually framed as
classification problems. In order to solve this problem we may use different
features like amount, merchant, location, time etc associated with each
transaction.
One of the biggest challenge with fraud transaction detection is- majority of
transactions are not fraud, so we have inbalance data!
First step will be to do EDA and understand our data and intesity of class
inbalance.
In order to handle inbalance data problem we can use one of the following
method
Oversampling — SMOTE (Synthetic Minority Over-sampling Technique)
Undersampling — One simple way of undersampling is randomly
selecting a handful of samples from the class that is overrepresented.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Combined Class Methods — Use SMOTE together with edited nearest-
neighbours (ENN). Here, ENN is used as the cleaning method after
SMOTE over-sampling to obtain a cleaner space.
Developed by Wilson (1972), the ENN method works by finding the K-
nearest neighbor of each observation first, then check whether the
majority class from the observation’s k-nearest neighbor is the same
as the observation’s class or not.
If the majority class of the observation’s K-nearest neighbor and the
observation’s class is different, then the observation and its K-
nearest neighbor are deleted from the dataset. In default, the
number of nearest-neighbor used in ENN is K=3.
As ENN removes the observation and its K-nearest neighbor instead
of just removing observation and its 1-nearest neighbor that are
having different classes. Thus, ENN can be expected to give more in-
depth data cleaning.
Test model performane for each of above technique and choose best
performing model.

Why might accuracy be a bad metric for evaluating


success?
In case of inbalance data accuracy metric is not usefull. Accuracy tells us
how close a measured value is to the actual (true) value. But here we are
more interested in fraud transactions.
We dont mind declaring few good transactions as fruad but failing to identify
fraud transaction is not acceptable. In such cases classification of true
positives is a priority, hence precision metric make more sense.

Explain inner working on linear


regression
https://satishgunjal.com/univariate_lr/

What are the assumptions for linear


regression
Linear regression assumptions are as below

Data should have linear relationship between X and Y (actually mean of Y)


Data should be normally distributed

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
No or little multicollinearity (observations should be independent of each
other)
Assumption of additivity means effect changes in one of the features on the
response variable does not depend on the values of the other features.
Homoscedasticity- There should not be unequal variance in data

Explain inner working on logistic


regression
https://satishgunjal.com/binary_lr/

How can AI be used in spam email


detection?
First we collect spam and ham email data
Then we find the statistical relations between words to create feature matrix
to train the classification models.
Trained classification models can be used to determine whether a piece of
text belongs to a certain class.
We can use algorithms like Naïve Bayes, RNN and transformers for spam
detection

What are the advantages and


disadvantages of neural networks?
Here are some advantages of Neural Networks

Storing information on the entire network: Information such as in traditional


programming is stored on the entire network, not on a database. The
disappearance of a few pieces of information in one place does not restrict
the network from functioning.
The ability to work with inadequate knowledge: After ANN training, the data
may produce output even with incomplete information. The lack of
performance here depends on the importance of the missing information.
It has fault tolerance: Corruption of one or more cells of ANN does not
prevent it from generating output. This feature makes the networks fault-
tolerant.
Having a distributed memory: For ANN to be able to learn, it is necessary to
determine the examples and to teach the network according to the desired
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
output by showing these examples to the network. The network's progress is
directly proportional to the selected instances, and if the event can not be
shown to the network in all its aspects, the network can produce incorrect
output
Gradual corruption: A network slows over time and undergoes relative
degradation. The network problem does not immediately corrode.
Ability to train machine: Artificial neural networks learn events and make
decisions by commenting on similar events.
Parallel processing ability: Artificial neural networks have numerical strength
that can perform more than one job at the same time.

Disadvantages of Neural Networks

Hardware dependence: Artificial neural networks require processors with


parallel processing power, by their structure. For this reason, the realization
of the equipment is dependent.
Unexplained functioning of the network: This is the most important problem
of ANN. When ANN gives a probing solution, it does not give a clue as to why
and how. This reduces trust in the network.
Assurance of proper network structure: There is no specific rule for
determining the structure of artificial neural networks. The appropriate
network structure is achieved through experience and trial and error.
The difficulty of showing the problem to the network: ANNs can work with
numerical information. Problems have to be translated into numerical values
before being introduced to ANN. The display mechanism to be determined
here will directly influence the performance of the network. This depends on
the user's ability.
The duration of the network is unknown: The network is reduced to a certain
value of the error on the sample means that the training has been
completed. This value does not give us optimum results.

What is the difference between bias


and variance?
Bias comes from model underfitting some set of data, whereas variance is
the result of model overfitting some set of data.
Underfitting models have high error in training as well as test set. This
behavior is called as ‘High Bias’
Consider below example of bias(underfitting) where we are trying to fit linear
function for nonlinear data.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Overfitting models have low error in training set but high error in test set.
This behavior is called as ‘High Variance’
Consider below example of variance(overfitting) where complicated function
creates lots of unnecessary curves and angles that are not related with data.

Low bias (low underfitting) ML algorithms: Decision Tree, k-NN, SVM


High bias (high underfitting) ML algorithms: Linear regression, Logistic
regression
High Variance (high overfitting) ML algorithms: Polynimial regression
Reference: https://satishgunjal.com/underfitting_overfitting/

What is bias-variance tradeoff


As we increase the complexity of the model, error will reduce due to lower
bias in the model. However, this will happen until a particular point. If we
continue to make our model complex then model will overfit and lead to high
variance.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
The goal of any supervised ML algorithm to have low bias and low variance
to achieve good prediction performance. This is referred as bias-variance
tradeoff. We can acheive bias-variance tradeoff by selecting optimum model
complexity.

We can also use hyperparamters to adjust model complexity, few


examples are as below

The K-NN algorithm has low bias(underfitting) and high variance(overfitting),


tradeoff can be achieved by increasing the value of 'K'.
Higher the value of 'K' means higher the number of neighbours, which in
turn increases the bias of the model.
The SVM algorithm has low bias(underfitting) and high variance(overfitting),
trade off can be achived by changing the 'C' paramter.
The C parameter tells the SVM optimization how much you want to avoid
misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin
hyperplane if that hyperplane does a better job of getting all the training
points classified correctly.
Conversely, a very small value of C will cause the optimizer to look for a
larger-margin separating hyperplane, even if that hyperplane
misclassifies more points.
For very tiny values of C, you should get misclassified examples, often
even if your training data is linearly separable.
The decision trees has low bias(underfitting) and high variance(overfitting),
bias-variance tradeoff can be achived by changing the tree depth.
If the tree is shallow then we're not checking a lot of
conditions/constrains i.e. the logic is simple or less complex, hence it
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
automatically reduces over-fitting. This introduces more bias compared
to deeper trees where we overfit the data. It can be imagined as we're
deliberately not calculating more conditions means we're making some
assumption (introduces bias) while creating the tree.
The linear regression has low variance(overfitting) and high
bias(underfitting), bias-variance tradeoff can be acheived by increasing the
number of features or by using another regression technique that can fit
data better.
If data is not linearly separable then linear regression algorithm will
result in low variance and high bias.

What is more important model


accuracy or model performance?
Short answer is: Model accuracy matters the most! inaccurate information is
not usefull.
Model performance can be improved by increasing the compute resources.
Model accuracy and performance can be subjective to the problem in hand.
For example, in analysis of medical images to determine if there is a disease
(such as cancer), the accuracy extremely critical, even if the models would
take minutes or hours to make a prediction.
Some applications require real time performance, even if this comes at a
cost of accuracy. For example, imagine a machine that views a fast conveyor
belt carrying tomatoes, where it must separate the green from the red ones.
Though an occasional error is undesired, the success of this machine is more
determined by its ability to withstand its throughput.
A more common example is face detection for recreational applications.
People would expect a fast response from the app, though the occasional
missed face would not render it useless.
Reference: https://www.quora.com/Which-is-more-important-to-you-model-
accuracy-or-model-performance

What is the difference between


machine learning and deep learning?
Deep Learning out performs traditional ML techniques if the data size is large.
But with small data size, traditional Machine Learning algorithms are preferable.
Deep Learning really shines when it comes to complex problems such as image
classification, natural language processing, and speech recognition. Few
important differences are as below,
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Machine Learning Deep Learning

Machine learning uses algorithms to Deep learning structures algorithms in


parse data, learn from that data, and layers to create an "artificial neural
make informed decisions based on what network” that can learn and make
it has learned intelligent decisions on its own

Using handcrafted rules and feature Deep Learning algorithms need large
engineering, ML algorithms can work well data to understand it perfectly. Deep
with small data. But its performance learning performances increases as data
plateau once data increases. increases.

DL algorithms need high compute. There


Traditional ML algorithms can work on
also special purpose compute for DL like
less computing power.
GPu and TPU.

In case of ML domain experts/Data


One advantage with DL that it learns high
scientists needs to do feature
level features from data.No extrnal
engineering in order to enable model o
feature engineering is required.
learn all data patterns.

ML models take less time to train DL models take more time to train

Ml models are easy to interpret as DL models are black box and its very
comare to DL models. difficult to interpret the results.

Explain standard deviation and


variance

Standard Deviations
1 1

68%
−3 −2 −1 +1 +2 +3

Standard deviation is measure of how spread out numbers are Formula σ =


square root of the Variance
Variance is defined as “The average of the squared differences from the
Mean.”
Using the Standard Deviation we have a "standard" way of knowing what is
normal, and what is extra large or extra small.
We can expect about 68% of values to be within plus-or-minus 1 standard
deviation.
A low standard deviation indicates that the data points tend to be very close
to the mean; a high standard deviation indicates that the data points are
spread out over a large range of values
Ref. https://www.mathsisfun.com/data/standard-deviation.html#Top

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Explain confusion matrix

The confusion matrix is one of the most powerful tools for predictive analysis
in machine learning.
A confusion matrix gives you information about how your machine classifier
has performed, pitting properly classified examples against misclassified
examples.
Confusion matrices are used to visualize important predictive analytics like
recall, specificity, accuracy, and precision.
Confusion matrices are useful because they give direct comparisons of
values like True Positives, False Positives, True Negatives and False
Negatives. In contrast, other machine learning classification metrics like
“Accuracy” give less useful information, as Accuracy is simply the difference
between correct predictions divided by the total number of predictions.
All estimation parameters of the confusion matrix are based on 4 basic
inputs namely True Positive, False Positive, True Negative and False Negative.
Confusion matrices have two types of errors: Type I (False Positive) and Type
II (False Negative). False Positive contains one negative word (False) so it’s a
Type I error. False Negative has two negative words (False + Negative) so it’s
a Type II error.
From our confusion matrix, we can calculate five different metrics measuring
the validity of our model.
ACCURACY

Accuracy is the ratio of correctly identified subjects in a pool of subjects.

Accuracy = (all correct / all) = (TP+TN)/(TP+FP+FN+TN)

Accuracy answers the question: How many patients did we correctly


identify out of all patients?
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
PRECISION

Precision is the ratio of correctly identified +ve subjects by test, against


all +ve subjects identified by test.

Precision = (true positives / predicted positives) = TP/(TP+FP)

Precision answers the question: How many patients tested +ve are
actually +ve?

This metric is often used in cases where classification of true positives is


a priority. For example, a spam email classifier would rather classify
some spam emails as regular emails rather than classify some regular
emails as spam. That’s why some spam emails end up in your main
inbox, just to be safe. (Here true positives are the spam emails)

SENSITIVITY (RECALL)

Sensitivity is the ratio of correctly identified +ve subjects by test against


all +ve subjects in reality.

Sensitivity = (true positives / all actual positives)= TP/(TP+FN)

Sensitivity answers the question: Of all the patients that are +ve, how
many did the test correctly predict?

This metric is often used in cases where classification of false negatives


is a priority. A good example is the medical test that we used for
illustration above. The government would rather have some healthy
people labeled +ve than have an infected individual labeled -ve and
spread the disease. We would rather be overly cautious and have false
positives than risk wrongly identifying false negatives.

SPECIFICITY

Specificity is the ratio of correctly identified -ve subjects by test against


all -ve subjects in reality.

Specificity = (true negatives / all actual negatives) = TN/(TN+FP)

Specificity answers the question: Of all the patients that are -ve, how
many did the test correctly predict?

This metric is often used in cases where classification of true negatives is


a priority. For example, a doping test will immediately ban an athlete if
they are tested positive. We would not want to any drug-free athlete to
be wrongly classified and banned.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
F1 SCORE

F1 Score accounts for both precision and sensitivity.

F1 Score = 2 * (Recall * Precision)/(Recall + Precision)

It is often considered a better indicator of a classifier’s performance than


a regular accuracy measure as it compensates for uneven class
distribution in the training dataset. For example, an uneven class
distribution is likely to occur in insurance fraud detection, where a large
majority of claims are legitimate and only a very small minority are
fraudulent.

Which metric to use is depends on the problem in hand

Why do we need confusion matrix?


We can not rely on a single value of accuracy in classification when the
classes are imbalanced.
For example, we have a dataset of 100 patients in which 5 have diabetes
and 95 are healthy. However, if our model only predicts the majority class i.e.
all 100 people are healthy then also we will have a classification accuracy of
95%.
Confusion matrices are used to visualize important predictive analytics like
recall, specificity, accuracy, and precision.
Confusion matrices are useful because they give direct comparisons of
values like True Positives, False Positives, True Negatives and False
Negatives.

Explain collinearity and technique to


reduce it?
In statistics collinearity or multicollinearity is the phenomenon where one or
more predictive variables(features) in multiple regression models are highly
linearly related to each other.

Technique to reduce multicollearity


Remove highly correlated predictors from the model. If you have two
or more factors with a high collinearity, remove one from the model.
Because they supply redundant information, removing one of the correlated
factors usually doesn't drastically
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js reduce the R-squared. Consider using
stepwise regression, best subsets regression, or specialized knowledge of the
data set to remove these variables. Select the model that has the highest R-
squared value.
Principal Components Analysis(PCA) regression methods that cut the
number of predictors to a smaller set of uncorrelated components.

Difference between statistics and


machine learning
The major difference between machine learning and statistics is their
purpose. Machine learning models are designed to make the most accurate
predictions possible. Statistical models are designed for inference about the
relationships between variables.
Statistics is mathematical study of data. Lots of statistical models that can
make predictions, but predictive accuracy is not their strength.

In a test, students in section A scored


with a mean of 75 and standard
deviation of 10, while students in
section B scored with a mean of 80
and standard deviation of 12? Melissa
from section A and Ryan from section
B both have scored 90 in this test.
Who had a better performance in this
test as compared to their classmates?
To compare the two scores we need to standardize them to the same scale. We
do that by calculating the Z score, which allows us to compare the 2 scores in
units of standard deviations.

Z score= (X- mean)/Standard Deviation

Melissa's Z score = (90-75)/10 = 1.5

Ryan's Z score = (90-80)/12 = 0.83

Melissa has performed better.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
What is null hypothesis and alternate
hypothesis?
The null hypothesis states that a population parameter (such as the mean,
the standard deviation, and so on) is equal to a hypothesized value. The null
hypothesis is often an initial claim that is based on previous analyses or
specialized knowledge.
The alternative hypothesis states that a population parameter is smaller,
greater, or different than the hypothesized value in the null hypothesis. The
alternative hypothesis is what you might believe to be true or hope to prove
true.
So when running a hypothesis test/experiment, the null hypothesis says that
there is no difference or no change between the two tests. The alternate
hypothesis is the opposite of the null hypothesis and states that there is a
difference between the two tests.

What is a hypothesis test and p-value?


A hypothesis test examines two opposing hypotheses about a population:
the null hypothesis and the alternative hypothesis. The null hypothesis is the
statement being tested. Usually the null hypothesis is a statement of "no
effect" or "no difference". The alternative hypothesis is the statement you
want to be able to conclude is true based on evidence provided by the
sample data.
Based on the sample data, the test determines whether to reject the null
hypothesis. You use a p-value, to make the determination. If the p-value is
less than the significance level (denoted as alpha), then you can reject the
null hypothesis.
In laymans term the p-value is the probability that the null hypothesis is true.
Consider the example where we are trying to test whether a new marketing
campaign generates more revenue
Here null hypothesis states that there is no change in the revenue as a
result of the new marketing campaign
Based on p-value we can accept or reject the null hypothesis. 0.25 p-
value means there is 25% chance that new marketing campaign will not
change revenue
Lower the p-value, the more confident we are that the alternate
hypothesis is true, which, in this case, means that the new marketing
campaign causes an increase or decrease in revenue.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
In most fields, acceptable p-values should be under 0.05 while in other
fields a p-value of under 0.01 is required.
So when a result has a p-value of 0.05 or lower we can reject null
hypothesis and accept the alternate hypothesis.

More Info: Basic Concepts of Hypothesis


Testing
In simple term hypothesis is a assumption. Since its assumption, after our
testing it can hold true may not.
If our assumption holds true after testing then it is termed as 'Null
Hypothesis' unless there is evidence against it.
If our assumption dont hold true and there is claim aganist it, then it is
termed as alternate hypothesis.
So when our assumption dont hold true then Type I error occurs. Here we are
going against the Null Hypothesis.
p-value: It is calculated probability of making type I error
Reference: https://www.youtube.com/watch?v=d0eVIUyt_Uc

What is power of hypothesis test? Why


is it important?
Remember that if actual value is positive and our model predicts it as
negative then Type II error occuras (False negative). e.g. Calling a guilty
person innocent, diaognosing cancer infected person as healthy etc.
The probability of not commiting Type II error is called as power of
hypothesis test. The higher probability we have of not commiting a type 2
error, the better our hypothesis test is.

What is the difference betweeen K


nearest neighbors and K means
KNN or K nearest neighbor is a classification algorithm, while K-Means is
clustering technique.
KNN is supervised algorithm, K means is unsupervised algorithm.
In KNN prediction of the test sample is based on the similarity of its features
to its neighbors. The similarity is computed based on the measure such as
euclidean distance. Here K referes to the number of neighbors with whom
similarity is being compared.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
K-means is the process of defining clusters or groups around predefined
centroids based on the similarity of each data point to each other. Here K
referes to the number of centroids around which clusters will be formed.

Explain Random forest algorithm


Random forest is supervised learning algorithm and can be used to solve
classification and regression problems.
Since decision-tree create only one tree to fit the dataset, it may cause
overfitting and model may not generalize well. Unlike decision tree random
forest fits multiple decision trees on various sub samples of dataset and
make the predictions by averaging the predictions from each tree.
Averaging the results from multiple decision trees help to control the
overfitting and results in much better prediction accuracy. As you may have
noticed, since this algorithm uses multiple trees hence the name ‘Random
Forest’
Reference: Random Forest
This algorithm is heavily used in various industries such as Banking and e-
commerce to predict behavior and outcomes.

Can Random Forest Algorithm be used


both for Continuous and Categorical
Target Variables?
Yes, Random Forest can be used for both continuous and categorical target
(dependent) variables.
In a random forest the classification model refers to the categorical
dependent variable, and the regression model refers to the numeric or
continuous dependent variable.

What do you mean by Bagging?

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
In bagging we build independent estimators on different samples of the
original data set and average or vote across all the predictions.
Bagging is a short form of Bootstrap Aggregating. It is an ensemble
learning approach used to improve the stability and accuracy of machine
learning algorithms.
Since multiple model predictions are averaged together to form the final
predictions, Bagging reduces variance and helps to avoid overfitting.
Although it is usually applied to decision tree methods, it can be used with
any type of method.
Bagging is a special case of the model averaging approach, in case of
regression problem we take mean of the output and in case of classification
we take the majority vote.
Bagging is more helpfull if we have over fitting (high variance) base models.
We can also build independent estimators of same type on each subset.
These independent estimators also enable us to parallelly process and
increase the speed.
Most popular bagging estimator is 'Bagging Tress' also knows as 'Random
Forest'

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Bootstrapping
It is a resampling technique, where large numbers of smaller samples of the
same size are repeatedly drawn, with replacement, from a single original
sample.
So this technique will enable us to produce as many subsample as we
required from the original training data.
The defination is simple to understand, but "replacement" word may be
confusing sometimes. Here 'replacement' word signifies that the same
obervation may repeat more than once in a given sample, and hence this
technique is also known as sampleing with replacement

As you can see in above image we have training data with observations from
X1 to X10. In first bootstrap training sample X6, X10 and X2 are repeated
where as in second training sample X3, X4, X7 and X9 are repeated.
Bootstrap sampling helps us to generate random sample from given training
data for each model in order to genralise the final estimation.
So in case of Bagging we create multiple number of bootstrap samples from
given data to train our base models. Each sample will contain training and
test data sets which are different from each other and remember that
training sample may contain duplicate observations.

What is Out-of-Bag Error in Random


Forests?
Out-of-Bag is equivalent to validation or test data but it is calculated
internally by Random Forest algorithm. In case of Sklearn if we set

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
hyperparameter 'oob_score = True' then Out-of-Bag score will be calculated
for every decision tree.
Finally, we aggregate all the errors from all the decision trees and we will
determine the overall OOB error rate for the classification.
For more details refer. https://towardsdatascience.com/what-is-out-of-bag-
oob-score-in-random-forest-a7fa23d710

What is the use of proximity matrix in


the random forest algorithm?
A proximity matrix is used for the following cases :

Missing value imputation


Detection of outliers

List down the parameters used to fine-


tune the Random Forest.
Two parameters that have to fine-tune to improve the predictions that are
important in the random forest algorithm are as follows:

Number of trees used in the forest (n_tree)


Number of random variables used in each of the trees in the forest (mtry)

What is K Fold cross validation? Why


do you use it?
In case of K Fold cross validation input data is divided into ‘K’ number of
folds, hence the name K Fold. Suppose we have divided data into 5 folds i.e.
K=5. Now we have 5 sets of data to train and test our model. So the model
will get trained and tested 5 times, but for every iteration we will use one
fold as test data and rest all as training data. Note that for every iteration,
data in training and test fold changes which adds to the effectiveness of this
method.
This significantly reduces underfitting as we are using most of the data for
training(fitting), and also significantly reduces overfitting as most of the data
is also being used in validation set.
K Fold cross validation helps to generalize the machine learning model,
which results in better predictions on unknown data.
Reference: K Fold Cross
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js Validation
How to handle missing data?
Data can be missing because of mannual error or can be gennualy missing.

Delete low quality records completely which have too much missing data
Impute the values by educated guess, taking average or regression
Use domain knwledge to impute values

What is the difference between Bar


graph and histogram?
Bar graph is used for descreate data where as histogram is used for
continuous data.
In bar graph there is space between the bars and in case of histogram there
is no space between the bars(contnuous scale).
In bar graph the order of the bars can be changed and in histogram order
remains same.

What is the Box and Whisker plot?


When should use it?
Box and whisker plots are ideal for comparing distributions because the
centre, spread and overall range are immediately apparent.

A box and whisker plot is a way of summarizing a set of data measured on an


interval scale.

It is often used in explanatory data analysis

Boxplots are a standardized way of displaying the distribution of data based


on a five number summary (“minimum”, first quartile (Q1), median, third
quartile (Q3), and “maximum”).

median (Q2/50th Percentile): the middle value of the dataset.


first quartile (Q1/25th Percentile): the middle number between the
smallest number (not the “minimum”) and the median of the dataset.
third quartile (Q3/75th Percentile): the middle value between the median
and the highest value (not the “maximum”) of the dataset.
interquartile range (IQR): 25th to the 75th percentile.
whiskers (shown in blue)
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
outliers (shown as green circles)
“maximum”: Q3 + 1.5*IQR
“minimum”: Q1 -1.5*IQR

Ref. https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

What is outlier? How to handle them?


An outlier is an observation that lies an abnormal distance from other values
in a random sample from a population.
Data points above and below 1.5*IQR, are most commonly outliers.

Outliers can drastically change the results of the data analysis and statistical
modeling.

Types of the outliers


Data entry errors
Measuremental errors
Intentional outliers. This is commonly found in self-reported measures that
involves sensitive data. For example: Teens would typically under report the
amount of alcohol that they consume.
Data processing erros. Whenever we perform data mining, we extract
data from multiple sources. It is possible that some manipulation or
extraction errors may lead to outliers in the dataset.
Sampling error. For instance, we have to measure the height of athletes.
By mistake, we include a few basketball players in the sample. This inclusion
is likely to cause outliers in the dataset.
Natutal oulier. When an outlier is not artificial (due to error), it is a natural
outlier. For instance: In my problem assignment with one of the renowned
insurance company, I noticed that the performance of top 50 financial
advisors was far higher than rest of the population. Surprisingly, it was not
due to any error. Hence, whenever we perform any data mining activity with
advisors, we used to treat this segment separately.

How to detect Outliers?


Most commonly used method to detect outliers is visualization.

We use various visualization methods, like Box-plot, Histogram, Scatter Plot

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Use capping methods. Any value which out of range of 5th and 95th
percentile can be considered as outlier
Data points, three or more standard deviation away from mean are
considered outlier

Apart from visualization we can also use Z-Score or Extreme Value Analysis
(parametric) to detect outliers.

How to remove outliers?


Most of the methods used to handle missing values are aslo application in case
ot outliers

Deleting observations
We delete outlier values if it is due to data entry error, data processing error or
outlier observations are very small in numbers. We can also use trimming at both
ends to remove outliers.

Transforming and binning values


Transforming variables can also eliminate outliers. Natural log of a value reduces
the variation caused by extreme values. Binning is also a form of variable
transformation. Decision Tree algorithm allows to deal with outliers well due to
binning of variable.

Imputing
We can use mean, median, mode imputation methods.

Treat separately
If there are significant number of outliers, we should treat them separately in the
statistical model. One of the approach is to treat both groups as two different
groups and build individual model for both groups and then combine the output.

Reference: https://www.linkedin.com/pulse/techniques-outlier-detection-
treatment-suhas-jk/

If deleting outliers is not an option,


how will you handle them?
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
I will try differen models. Data detected as outliers by linear model, can be fit
by non-linear model.
Try normalizing the data, this way the extreame datapoints are pulled to the
similar range.
We can use algorithms which are less affected by outliers.
We can also create separate model to handle the outlier data points.

You fit two linear models on a dataset.


Model 1 has 25 predictors and model 2
has 10 predictors. What performance
metric would you use to select the
best model based on training dataset?
First of all model performace is not directly proportional to the number of
predictors, so we cant say that model with 25 predictors is better than the
model with 10 predictors

Here important thing is to understand different evaluation metric for linear


regresion and which one of them can help us identify the impact of number
of predictors on model performance.

Evaluation metric used for linear regression are MSE, MAE, R-squared,
Adjusted R-squared, and RMSE.

MSE penalizes large errors, MAE does not penalize large errors, RMSE
penalizes large errors and R-squared or Coefficient of Determination
represent the strength of the relationship between your model and the
dependent variable.

Though R-squared represent the strength of relationship between model and


the dependent variables, it is never used for comparing the models as the
value of R² increases with the increase in the number of predictors (even if
these predictors do not add any value to the model)

Now only remaining metric is Adjusted R-squared. Unlike R-squared,


Adjusted R-squared measures variation explained by only the independent
variables that actually affect the dependent variable.

So the Adjusted R-squre score will increase only if addition of predictors


improve the models performance significantly or else it will decrease. Hence
correct answer is Adjusted R-squared

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Reference: https://www.youtube.com/watch?
v=lRAgottY8XU&list=PLjW9PIyfCennBOprV3CPoqMX8SW-qNlUa

Suppose we have a function -4x^2 +


4x + 3. Find the maximum or minimum
of this function.
This is quadratic equation, f(x) = -4x^2 + 4x + 13 (for a function: ax^2 + bx
+ c, when a < 0, then function has maximum value)

To find the slope of the function, lets take derivative of it

f'(x)= -8x + 4

At maximum point, slope will be 0

-8x + 4 = 0

x = 0.5

Now lets put 0.5 in equation to find the maximum values

f(0,5) = -4(0.5)^2 + 4(0.5) + 13 = -1 + 2 +13 = 14

This functiona will have concave shape. So the maximum point is (0.5, 14)

Reference: https://www.youtube.com/watch?
v=lRAgottY8XU&list=PLjW9PIyfCennBOprV3CPoqMX8SW-qNlUa

Below is the output of a correlation


matrix from your Exploratory data. Is
using all the features in a model
appropriate for predicting/inferencing
Y?

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
We can see from above correlation matrix that there is high correlation(.98)
between X1 and X2, also high correlation(.88) between X1 and X3, similarly
there is high correlation(.75) between X2 and X3
All the variables are correlated to each other. In regression this would result
in multicollinearity. We can try methods such as dimension reduction, feature
selection, stepwise regression to choose the correct input variables for
predictiong Y
Second part of question is - should we use all the variables for modeling?
Using multicolnear feature in modeling doesnt help. We should remove
all the multicolnear feature and keep unique feature so that explaining
the model predictions also becomes easy.
It will also make model less complex and we dont have to store many
features.

Prediction vs Inference
Inference and prediction are two often confused terms, perhaps in part because
they are not mutually exclusive.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Reference: https://www.datasciencecentral.com/profiles/blogs/inference-vs-
prediction-in-one-picture

What is stepwise regression?


Stepwise regression is a method of fitting regression models in which the choice
of predictive variables is carried out by an automatic procedure. In each step, a
variable is considered for addition to or subtraction from the set of explanatory
variables based on some prespecified criterion.

Stepwise regression is classified into backward and forward selection.

Backward selection starts with a full model, then step by step we reduce
the regressor variables and find the model with the least RSS, largest R², or
the least MSE. The variables to drop would be the ones with high p-values.
Forward selection starts with a null model, then step by step we increase
the regressor variables until we can no longer improve the error performance
of the model. We usually pick the model with the highest adjusted R².

You have two buckets - one of 3 liters


and other of 5 liters. You are expected
to mesure exactly 4 liters. How will you
complete the task? Note: There is no
thrid bucket.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Questions like this will test your out of the box thinking
Step1: Fill 5 lts bucket and empty it in 3 ltr bucket. Now we are left with 2 ltr
in 5 ltr bucket.
Step2: Empty 3 ltr bucket and pour the contents of 5 ltr bucket in 3 ltr
bucket. Now our 5 ltr bucket is empty and 3 ltr bucket has 2 ltr content in it.
Now fill the 5 ltr bucket again. Remember that our 3 ltr bucket has 2 ltr
content in it, so if we pour 1 ltr content from 5 ltr bucket to 3 ltr bucket we
are left with 4 ltr content in 5 ltr bucket.
Reference: https://youtu.be/5JZsSNLXXuE

Lis the differences between supervised


and unsupervised learning
| Supervised learning | Unsupervised leanring |:- |:- Uses labeled data as input |
Uses unlabeled data as input Supervised learning has feedback mechanism |
Unsupervised learning has no feedback mechanism Common supervised learning
algorithms are decision tree, logistic regression, support vector machine etc | K
Means clustering, hierarchical clustering etc

Reference: https://youtu.be/5JZsSNLXXuE

Explain the steps in making decision


tree?

Below are the common steps in decision tree algorithm

Take the entire data as input


Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
At the root node decision tree selects feature to split the data in two major
categories.
Different criteria will be used to split the data. We generally use 'entropy' or
'gini' in case of classification and 'mse' or 'mae' in case of regression
problems.
Features are selected for spliting based on highest information gain.
After every split we get decision rules and sub trees.
This process will continue until every training example is grouped together or
maxinum allowed tree depth is reached.
So at the end of decision tree we end up with leaf node. Which represent the
class or a continuous value that we are trying predict
Reference: https://satishgunjal.com/decision_tree/

How do you build random forest


model?
Ranodm forest is made up of multiple decision trees. Unlike decision tree random
forest fits multiple decision trees on various sub samples of dataset and make
the predictions by averaging the predictions from each tree.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Select few random sub sample from given dataset
Construct a decision tree for every sub sample and predict the result.
Perform the voting on prediction from each tree.
At the end select the most voted result as final prediction.
Reference: https://satishgunjal.com/random_forest/

How do Random Forest handle missing


data?
Note that handling missing data is one of the advantages of Random Forest
algorithm over Decision tree. Please refer below diagram where we have training
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
data set of circle, square and triangle of color red, green and blue respectively.
There are total 27 training examples.

Random forest will create three sub sample of 9 training examples each

Random forest algorithm will create three different decision tree for each sub
sample

Notice that each tree uses different criteria to split the data

Now it is straight forward analysis for the algorithm to predict the shape of
given figure if its shape and color is known. Let’s check the predictions of
each tree for blue color triangle, (here shape input is missing)
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Tree 1 will predict: triangle
Tree 2 will predict: square
Tree 2 will predict: triangle
Since the majority of voting is for triangle final prediction is ‘triangle shape’
Now, lets check predictions for circle with no color defined (color attribute is
missing here)

Tree 1 will predict: triangle


Tree 2 will predict: circle
Tree 2 will predict: circle
Since the majority of voting is for circle final prediction is ‘circle shape’

Please note this is over simplified example, but you get an idea how multiple
tree with different split criteria helps to handle missing features

Reference: https://satishgunjal.com/random_forest/

What is model overfitting? How can


you avoid it?
Overfitting occurs when your model learns too much from training data and isn't
able to generalize the underlying information. When this happens, the model is
able to describe training data very accurately but loses precision on every
dataset it has not been trained on. Below images represent the overfitting linear
and logistic regression models.

How To Avoid Overfitting?

Since overfitting algorithm captures the noise in data, reducing the number
of features will help. We can manually select only important features or can
use model selection algorithm
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js for same
We can also use the ‘Regularization’ technique. It works well when we have
lots of slightly useful features. Sklearn linear model(Ridge and LASSO) uses
regularization parameter ‘alpha’ to control the size of the coefficients by
imposing a penalty.
K-fold cross validation. In this technique we divide the training data in
multiple batches and use each batch for training and testing the model.
Increasing the training data also helps to avoid overfitting.

Reference: https://satishgunjal.com/underfitting_overfitting/

There are 9 balls out of which one ball


is heavy in weight and rest are of the
same weight. In how many minimum
weightings will you find the heavier
ball?
You will need two weightings

Step1: Out of 9 balls, place three balls on each side (you will have three
remaining balls)

Scenario: Balance out


In balance out scenario, the heaviest ball is definately part of three
remaining balls. Out of the remaining three balls from step 1, take two balls
and place one ball on each side- if they balance out then the left out ball will
be the heavier ball. Otherwise, you will see it in the balance.

Scenario: Not balance out


If the balls in step 1 do not balance out, that means heavier side has the
heavier ball.

Reference: https://www.youtube.com/watch?
v=5JZsSNLXXuE&list=PLwWVLyefnzgpWxe2WEPrmHqHzwHlyZw1U&index=2&t=

Difference between univariate,


bivariate and multivariate analysis?
Univariate Analysis

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Bivariate Analysis

Multivariate Analysis

What are feature selection methods to


select right variables?
Feature selection is the process of reducing the number of input variables when
developing a predictive model. There are two methods for feature selection.
Filter method and wrapper methods. Best analogy for selecting features is bad
data in bad answers out.

Filter Methods

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Filter feature selection methods use statistical techniques to evaluate the
relationship between each input variable and the target variable, and these
scores are used as the basis to choose (filter) those input variables that will
be used in the model.
These methods are faster and less computationally expensive than wrapper
methods.

Information Gain
Information gain calculates the reduction in entropy from the transformation of a
dataset. It can be used for feature selection by evaluating the Information gain of
each variable in the context of the target variable.

Chi-square Test
The Chi-square test is used for categorical features in a dataset. We calculate
Chi-square between each feature and the target and select the desired number
of features with the best Chi-square scores.

Correlation Coefficient
Correlation is a measure of the linear relationship of 2 or more variables.
Through correlation, we can predict one variable from the other. The logic behind
using correlation for feature selection is that the good variables are highly
correlated with the target. Furthermore, variables should be correlated with the
target but should be uncorrelated among themselves.

Wrapper Methods
Wrapper feature selection methods create many models with different
subsets of input features and select those features that result in the best
performing model according to a performance metric.
These methods are unconcerned with the variable types, although they can
be computationally expensive.
The wrapper methods usually result in better predictive accuracy than filter
methods.

Forward Feature Selection


This is an iterative method wherein we start with the best performing variable
against the target. Next, we select another variable that gives the best
performance in combination with the first selected variable. This process
continues until the preset criterion is achieved.

Backward Feature Elimination


Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
This method works exactly opposite to the Forward Feature Selection method.
Here, we start with all the features available and build a model. Next, we remove
the variable from the model which gives the best evaluation measure value. This
process is continued until the preset criterion is achieved.

Exhaustive Feature Selection


This is the most robust feature selection method covered so far. This is a brute-
force evaluation of each feature subset. This means that it tries every possible
combination of the variables and returns the best performing subset.

Reference: https://www.analyticsvidhya.com/blog/2020/10/feature-selection-
techniques-in-machine-learning/

In you choice of langauge: Write a


program that prints the numbers from
1 to 50. But for multiples of three print
"Fizz" instaed of the number and for
the multiples of five print "Buzz". For
the numbers which are multiples of
both three and five print "FizzBuzz".
In [1]: # The continue statement rejects all the remaining statements in the current
# moves the control back to the top of the loop.
for i in range(1, 51):
if (i%3 == 0 and i%5 == 0):
print("FizzBuzz")
continue
if i%3 == 0:
print("Fizz")
continue
if i%5 == 0:
print("Buzz")
continue
print(i)

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
Fizz
22
23
Fizz
Buzz
26
Fizz
28
29
FizzBuzz
31
32
Fizz
34
Buzz
Fizz
37
38
Fizz
Buzz
41
Fizz
43
44
FizzBuzz
46
47
Fizz
49
Buzz

You are given a dataset consisting of


variables having more than 30%
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
missing values? How will you deal with
them?
There are multiple ways to handle missing values in the data
If dataset is huge we can simply remove the rows containing the missing
data
If dataset is small then we have to impute the missing values. There are
multiple ways to impute the missing values. In case of categorical data we
may use the most common values and in case numerical data we can use
mean, median etc.
Reference: https://youtu.be/5JZsSNLXXuE

For the given point how will you


caluclate the Euclidean distance, in
Python?
Euclidean distance is calculated as the square root of the sum of the squared
differences between the two vectors.

Reference:
https://predictivehacks.com/tip-how-to-define-your-distance-function-for-
hierarchical-clustering/

In [2]: import math

# define the points


p1 = [6,5]
p2 = [3, 2]

euclidean_distance = math.sqrt( (p1[0]-p2[0])**2 + (p1[1]-p2[1])**2 )


print(euclidean_distance)

4.242640687119285

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
What is the angle between the hour
and minute hands of clock when the
time is half past six?

Reference: https://youtu.be/5JZsSNLXXuE

How should you maintain your


deployed model?
Monitor
Constant monitoring of all the models is needed to determine the performance
accuracy of the models

Evaluate
Evaluation metric of the current model is calculated to determine if new
algorithm is needed.

Compare
The new models are compared against each other to determine which model
performs the best.

Rebuild
The best performing model is re-built on the current set of data.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Reference: https://youtu.be/5JZsSNLXXuE

What are recommender systems?


The purpose of a recommender system is to suggest relevant items or
services to users.
Two major categories of recommender systems are collaboarative filtering
and cotent based filtering methods

Collaborative Filtering
It is based on the past interactions recorded between users and items in
order to produce new recommendations.
e.g. Music service recommends track that are often played by other users
with similar interests

Content Based Filtering


Unlike collaborative methods that only rely on the user-item interactions,
content based approaches use additional information about the content
consumed by the user to produce new recommendations
e.g. Music service recommends new song based on properties of the songs
user listens to.

Reference: https://youtu.be/5JZsSNLXXuE

'People who bought this, also


bought...'recommendations seen on
Amazon is a result of which algorithm?
Its done by recommendation system using collaborative filtering approach.
In case of collaborative filtering past interactions recorded between users
and items are used to produce new recommendations.

If it rains on saturday with probability


0.6, and it rains on sunday with
probability 0.2, what is the probability
that it rains this weekend?
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Since we know the probability of rain on Saturday and Sunday, the
probability of raining on Weekend is combination of both of these events.
Trick here is to know the probability of not raining on Saturday and Sunday.
If we subtract the intersection(∩) of both the events of not raining on
Saturday and Sunday from total probability then we get the probability of
raining on weekend.

= Total probability - (Probability that it will not rain on


Saturday) ∩ (Probability that it will not rain on Sunday)
= 1 - (1 - 0.6)*(1 - 0.2)
= 0.68

Reference: https://youtu.be/5JZsSNLXXuE

How can you select K for K-Means?


There are two ways to select the number of clusters in case K-Means clustering
algorithm

Visualization
To find the number of clusters manually by data visualization is one of the
most common method.
Domain knowledge and proper understanding of given data also help to
make more informed decisions.
Since its manual exercise there is always a scope for ambiguous
observations, in such cases we can also use ‘Elbow Method’

Elbow Method
In Elbow method we run the K-Means algorithm multiple times over a loop,
with an increasing number of cluster choice(say from 1 to 10) and then
plotting a clustering score as a function of the number of clusters.
Clustering score is nothing but sum of squared distances of samples to their
closest cluster center.
Elbow is the point on the plot where clustering score (distortion) slows down,
and the value of cluster at that point gives us the optimum number of
clusters to have.
But sometimes we don’t get clear elbow point on the plot, in such cases its
very hard to finalize the number of clusters.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Reference: https://satishgunjal.com/kmeans/#5

Explain dimensionality reduction, and


its benefits?
Dimensionality reduction referes to the process of converting a set of data
having vast dimensions into data with lesser dimensions(features) to convey
similar information concisely.
It helps in data compressing and reducing the storage space
It reduces computation time as less dimensions lead to less computing
It removes redundant features. E.g. There is no point in storing value in two
different units
Reference: https://youtu.be/5JZsSNLXXuE

How can you say that the time series


data is stationary?
For accurate analysis and forecasting trend and seasonality is removed from the
time series and converted it into stationary series. Time series data is said to be
stationary when statistical properties like mean, standard deviation are constant
and there is no seasonality. In other words statistical properties of the time series
data should not be a function of time.

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Reference: https://satishgunjal.com/time_series/

How can you calculate the accuracy


using confusion matrix?
Accuracy = (True Positive + true Negative) / Total Obervations

Write the equations for the precision


and recall?
Precision = True Positive / (True Positive + False Positive)

Recall = True Positive /(Total Positive + False Negative)

If a drawer containes 12 red socks, 16


blue socks, and 20 white socks, how
many must you pull out to be sure of
having a amcthing pair?
There are three colors of socks- Red, Blue and White. No of socks is irrelevant
here.
Suppose in our first pull we picked Red color sock
In second pull we picked Blue color sock
And in third pull we picked White color sock.
Now in our fourth pull, if we pick any color, match is guaranteed!! So the
answer is 4!
Reference: https://youtu.be/5JZsSNLXXuE

Write a SQL query to list all orders with


customer information
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Which of the following machine
learning algorithm can be used for
imputing missing values of both
categorical and continuos variables?
- K-means clustering
- Linear regression
- K-NN
- Decision tress

Using KNN we can compute the missing variable value by using the nearest
neighbors.

Given a box of matches and two ropes,


not necessarily identical, measure a
period of 45 minutes? Note: Ropes are
not uniform in natire and rope takes
exactly 60 minutes to completly burn
out
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
We have two ropes A and B
Ligt A from both the end and B from one end
When A finished burning we know that 30 minutes have elapsed and B has
30 minutes remaining
Now light the other end of B also, it will now burnout in 15 minutes
This we got 30 + 15 = 45 minutes
Reference: https://youtu.be/5JZsSNLXXuE

After studying the behaviour of


population, you have identified four
specific individual types who are
valueable to your study. You would like
find all users who are most similar to
each indivdual type. Which algorithm
is most approprate for this study?
- K-means clstering
- Linear regression
- Associate rules
- Decision tress

Answer is : K-means clustering

Reference: https://youtu.be/5JZsSNLXXuE

Your organization has a website where


visitors randomly receive one of the
two coupons. It is also possible that
visitors to the website will not receive
the coupon. You have been asked to
determine if offering a coupon to the
visitors to your website has any impact
on their purchase decision. Which
analysis method should you use?
- One Way ANOVA
- K-means clsutering

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
- Assiciation rules
- Student Test

Answer: One Way ANOVA

Reference: https://youtu.be/5JZsSNLXXuE

Explain Principal Componenet


Analysis?
Principal Component Analysis (PCA) is dimensionality reduction method, that
is used to reduce dimensionality of large data sets, by transforming large set
of variables into a smaller one that still contains most of the information in
large set.
Principal component analysis is a technique for feature extraction — so it
combines our input variables in a specific way, then we can drop the “least
important” variables while still retaining the most valuable parts of all of
the variables! As an added benefit, each of the “new” variables after PCA are
all independent of one another
Reducing the number of the variables of the datset naturally comes at the
expense of accuracy, but the trick in dimensionality reduction is to trade a
little accuracy for simplicity.
By reducing the dimension of your feature space, you have fewer
relationships between variables to consider and you are less likely to overfit
your model

When should I use PCA?


Do you want to reduce the number of variables, but aren’t able to identify
variables to completely remove from consideration?

Do you want to ensure your variables are independent of one another?

Are you comfortable making your independent variables less interpretable?

Reference: https://builtin.com/data-science/step-step-explanation-principal-
component-analysis, https://towardsdatascience.com/a-one-stop-shop-for-
principal-component-analysis-5582fb7e0a9c

Explain feature scaling, normalization,


standardization
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Feature scaling is one of the most important data preprocessing step in
machine learning
If we are changing the range of the features then its called 'scaling' and if we
are changing the distribution of the features then its called
'normalization/standardization'

Scaling
This means that you're transforming your data so that it fits within a specific
scale, like 0-100 or 0-1. By scaling your variables, you can help compare
different variables on equal footing.
Scaling is required in case of distance based algorithms like support vector
machines (SVM) or k-nearest neighbors (KNN).
For example, you might be looking at the prices of some products in both Yen
and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale
your prices, methods like SVM or KNN will consider a difference in price of 1
Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with
our intuitions of the world. With currency, you can convert between
currencies. But what about if you're looking at something like height and
weight? It's not entirely clear how many pounds should equal one inch (or
how many kilograms should equal one meter).
Notice that the shape of the data doesn't change, but that instead of ranging
from 0 to 8ish, it now ranges from 0 to 1. Here we have used min-max
scaling

Normalization
Normalization is a more radical transformation, it changes data distribution
to 'normal distribution'

Normal distribution: Also known as the "bell curve", this is a specific


statistical distribution where a roughly equal observations fall above
and below the mean, the mean and the median are the same, and
there are more observations closer to the mean. The normal
distribution is also known as the Gaussian distribution.

Some machine learning algorithms assumes that data should be normally


distributed like linear discriminant analysis (LDA) and Gaussian naive Bayes.

Notice that the shape of our data has changed. Before normalizing it was
almost L-shaped. But after normalizing it looks more like the outline of a bell
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
(hence "bell curve"). Here we have used Box-Cox Transformation.

Standardization
Standardization typically means rescales data to have a mean of 0 and a
standard deviation of 1 (unit variance). For most applications standardization
is recommended.

Scikit-Learn provides a transformer called StandardScaler for


standardization.

Reference: https://www.kaggle.com/code/alexisbcook/scaling-and-
normalization/tutorial

Difference between standardisation


and normalization?
Normalization typically means rescales the values into a range of [0,1]. This
might be useful in some cases where all parameters need to have the same
positive scale. However, the outliers from the data set are lost.
Standardization typically means rescales data to have a mean of 0 and a
standard deviation of 1 (unit variance). For most applications standardization
is recommended.

What is meant by Data Leakage?


Data Leakage is the scenario where the Machine Learning Model is already
aware of some part of test data after training.This causes the problem of
overfitting.
In Machine learning, Data Leakage refers to a mistake that is made by the
creator of a machine learning model in which they accidentally share the
information between the test and training data sets.
Data leakage is a serious and widespread problem in data mining and
machine learning which needs to be handled well to obtain a robust and
generalized predictive model.

Examples of data leakage

The most obvious and easy-to-understand cause of data leakage is to include


the target variable as a feature. What happens is that after including the
target variable as a feature, our purpose of prediction got destroyed. This is
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
likely to be done by mistake but while modelling any ML model, you have to
make sure that the target variable is differentiated from the set of features.
Another common cause of data leakage is to include test data with training
data.

Above two cases are not very likely to occur because they can easily
be spotted while doing the modelling. Below are few data leakage
examples that are hard to troubleshoot.

Presence of Giveaway features

Let’s we are working on a problem statement in which we have to build a


model that predicts a certain medical condition. If we have a feature that
indicates whether a patient had a surgery related to that medical
condition, then it causes data leakage and we should never be included
that as a feature in the training data. The indication of surgery is highly
predictive of the medical condition and would probably not be available
in all cases. If we already know that a patient had a surgery related to a
medical condition, then we may not even require a predictive model to
start with.
Let’s we are working on a problem statement in which we have to build a
model that predicts if a user will stay on a website. Including features
that expose the information about future visits will cause the problem of
data leakage. So, we have to use only features about the current session
because information about the future sessions is not generally available
after we deployed our model.
Leakage during Data preprocessing

While solving a Machine learning problem statement, firstly we do the


data cleaning and preprocessing which involves the following steps:

Evaluating the parameters for normalizing or rescaling features


Finding the minimum and maximum values of a particular feature
Normalize the particular feature in our dataset
Removing the outliers
Fill or completely remove the missing data in our dataset
The above-described steps should be done using only the training set. If
we use the entire dataset to perform these operations, data leakage may
occur.

Applying preprocessing techniques to the entire dataset will cause the


model to learn not only the training set but also the test set. As we all
know that the test set should be new and previously unseen for any
model.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Reference: https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-
its-effect-on-the-performance-of-an-ml-
model/#:~:text=How%20does%20it%20exactly%20happen,“leakage”%20instea

How to detect Data Leakage?


Results are too good too true

In general, if we see that the model which we build is too good to be true
(i.,e gives predicted and actual output the same), then we should get
suspicious and data leakage cannot be ruled out.
At that time, the model might be somehow memorizing the relations
between feature and target instead of learning and generalizing it for the
unseen data.
So, it is advised that before the testing, the prior documented results are
weighed against the expected results.
Using EDA

While doing the Exploratory Data Analysis (EDA), we may detect features
that are very highly correlated with the target variable. Of course, some
features are more correlated than others but a surprisingly high
correlation needs to be checked and handled carefully.
We should pay close attention to those features. So, with the help of
EDA, we can examine the raw data through statistical and visualization
tools.
High weight features

After the completion of the model training, if features are having very
high weights, then we should pay close attention. Those features might
be leaky.
Reference: https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-
its-effect-on-the-performance-of-an-ml-
model/#:~:text=How%20does%20it%20exactly%20happen,“leakage”%20instea

How to fix the problem of Data


Leakage?
The main culprit behind this is the way we split our dataset and when. The
following steps can prove to be very crucial in preventing data leakage:

Select the features such a way that they do not contain information about
the target variable, which
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js is not naturally available at the time of prediction.
Create a Separate Validation Set

To minimize or avoid the problem of data leakage, we should try to set


aside a validation set in addition to training and test sets if possible.
The purpose of the validation set is to mimic the real-life scenario and
can be used as a final step.
By doing this type of activity, we will identify if there is any possible case
of overfitting which in turn can act as a caution warning against
deploying models that are expected to underperform in the production
environment.
Apply Data preprocessing Separately to both Train and Test subsets

While dealing with neural networks, it is a common practice that we


normalize our input data firstly before feeding it into the model.
Generally, data normalization is done by dividing the data by its mean
value. More often than not, this normalization is applied to the overall
data set, which influences the training set from the information of the
test set and eventually it results in data leakage.
Hence, to avoid data leakage, we have to apply any normalization
technique separately to both training and test subsets.
Problem with the Time-Series Type of data

When dealing with time-series data, we should pay more attention to


data leakage. For example, if we somehow use data from the future
when doing computations for current features or predictions, it is highly
likely to end up with a leaked model.
It generally happens when the data is randomly split into train and test
subsets.
So, when working with time-series data, we put a cutoff value on time
which might be very useful, as it prevents us from getting any
information after the time of prediction.
Reference: https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-
its-effect-on-the-performance-of-an-ml-
model/#:~:text=How%20does%20it%20exactly%20happen,“leakage”%20instea

What is selection bias?


Selection bias is the bias introduced by the selection of individuals, groups,
or data for analysis in such a way that proper randomization is not achieved,
thereby ensuring that the sample obtained is not representative of the
population intended to be analyzed. It is sometimes referred to as the
selection effect.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Sampling bias is usually classified as a subtype of selection bias, sampling
bias is a bias in which a sample is collected in such a way that some
members of the intended population have a lower or higher sampling
probability than others.
Due to sampling bias, the probability distribution in the collected dataset
deviates from its true natural distribution, which may affect ML models
performance.

Difference between supervised and


unsupervised learning
Supervised Unsupervised

Used for prediction Used for analysis

Labelled input data Unlabelled input data

Data need to be splitted into


No split required
train/validation/test sets

Used for clustering, dimension reduction &


Used in Classification and Regression
density estimation

Explain normal distribution of data


Data can be distributed (spread out) in different ways,

It can be spread out more on the left (Left skew)

More on the right (Right Skew)

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
It can be all jumbled up

But there are many cases where the data tends to be around a central value
with no bias left or right, and it gets close to a "Normal Distribution" like this:

Normal
Distribution
"Bell Curve"

The Normal Distribution has:


mean = median = mode
symmetry about the center
50% of values less than the mean and 50% greater than the mean

Reference: https://www.mathsisfun.com/data/standard-normal-distribution.html

What does it mean when distribution is


left skew or right skew?
TODO

What does the distribution looks like


for the average time spend watching
youtube per day?
TODO

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Expalin covariance and correlation
Covariance and Correlation are two mathematical concepts which are
commonly used in the field of probability and statistics. Both concepts
describe the relationship between two variables.

“Covariance” indicates the direction of the linear relationship between


variables. “Correlation” on the other hand measures both the strength and
direction of the linear relationship between two variables.

In case of High correlation, two sets of data are strongly linked together

Correlation is Positive when the values increase together, and


Correlation is Negative when one value decreases as the other increases
Perfect High Low Low High Perfect
Positive Positive Positive No Negative Negative Negative
Correlation Correlation Correlation Correlation Correlation Correlation Correlation

1 0.9 0.5 0 -0.5 -0.9 -1

Reference: https://www.mathsisfun.com

What is regularization. Why it is


usefull?
Regularization is the process of adding tunning parameter(penalty term) to a
model to induce smoothness in order to prevent overfitting.
The tunning parameter controls the excessively fluctuating function in such a
way that coefficients dont take extreame values.
There are two types of regularization as follows:
L1 Regularization or Lasso Regularization. L1 Regularization or Lasso
Regularization adds a penalty to the error function. The penalty is the
sum of the absolute values of weights.
L2 Regularization or Ridge Regularization. L2 Regularization or Ridge
Regularization also adds a penalty to the error function. But the penalty
here is the sum of the squared values of weights.

What are confouding varaiables?


Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
In statistics, confounder is a variable that influences both the dependent
variable and independent avriable.
If you are researeching whether a lack of exercise leads to weight gain. In
this case 'lack of exercise' is independent variable and 'weight gain' is
dependent variable. A confounding varaible in this case would be 'age' which
affect both of these variables.

Explain ROC curve and AUC


Receiver Operating Characteristics(ROC) curve is very usefull tool for
predicting the probability of binary classifier
It is a plot of the false positive rate (x-axis) versus the true positive rate (y-
axis) for a number of different candidate threshold values between 0.0 and
1.0
False positive rate = FP / (FP + TN)
True positive rate (Sensitivity) = TP / (TP + FN

For more detailed explaination please refer.


https://developers.google.com/machine-learning/crash-course/classification/roc-
and-auc

Explain Precision-Recall Curve

What is TF-IDF?
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
TF-IDF is a statistical measure that evaluates how relevant a word is to a
document in a collection of documents. This is done by multiplying two
metrics: how many times a word appears in a document, and the inverse
document frequency of the word across a set of documents
It is used in information retrieval and text mining
TF-IDF (term frequency-inverse document frequency) was invented for
document search and information retrieval. It works by increasing
proportionally to the number of times a word appears in a document, but is
offset by the number of documents that contain the word. So, words that are
common in every document, such as this, what, and if, rank low even though
they may appear many times, since they don’t mean much to that document
in particular.
However, if the word Bug appears many times in a document, while not
appearing many times in others, it probably means that it’s very relevant.
For example, if what we’re doing is trying to find out which topics some NPS
responses belong to, the word Bug would probably end up being tied to the
topic Reliability, since most responses containing that word would be about
that topic.
Reference: https://monkeylearn.com/blog/what-is-tf-idf/

Python or R- which one would you


prefer for text analytics?
We will prefer python for following reasons

We can use pandas library which has easy to use data structures and high
performance data analysis tools
R is more suitable for ML than text analytics
Python is faster for all types of text analytics.

What are Eigenvectors and


Eigenvalues?
Eigenvectors are used for understanding the linear transformations.
In data analysis, we usually calculate the eigenvectors for correlation or
covariance matrix
Eigenvectors are the directions along which a particular linear transformation
acts by flipping, compressing or stretching.
Eigenvalues can be referred to as the strength of the transaformation in
direction of eigenvector or the factor by which compression occurs.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Refer. https://www.youtube.com/watch?v=glaiP222JWA

Explain flase positive an false negative


with examples.
A false positive is where you receive a positive result for a test, when you
should have received a negative results. It’s sometimes called a “false
alarm” or “false positive error.” It’s usually used in the medical field, but it
can also apply to other arenas (like software testing).
Some examples of false positives:
A pregnancy test is positive, when in fact you aren’t pregnant.
A cancer screening test comes back positive, but you don’t have the
disease.
A prenatal test comes back positive for Down’s Syndrome, when your
fetus does not have the disorder(1).
Virus software on your computer incorrectly identifies a harmless
program as a malicious one.
False positives can be worrisome, especially when it comes to medical tests.
Researchers are consistently trying to identify reasons for false positives in
order to make tests more sensitive.
A related concept is a false negative, where you receive a negative result
when you should have received a positive one. For example, a pregnancy
test may come back negative even though you are in fact pregnant.
Reference: https://www.statisticshowto.com/false-positive-definition-and-
examples/

Explain the scenario where both false


positive and false negative are equally
important
In banking industry giving loans is the primary source of making money, but
at the same time bank can make profit if the repayment rate is good.
Bank always try not loose good customers and avoid bad ones. In this case
false positive and false negative becomes very important to measure.

Why feature scalling is required in


Gradient Descent Based Algorithms
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Machine learning algorithms like linear regression, logistic regression, neural
network, etc. that use gradient descent as an optimization technique require
data to be scaled. Take a look at the formula for gradient descent below:

The presence of feature value X in the formula will affect the step size of the
gradient descent.

The difference in ranges of features will cause different step sizes for each
feature.

To ensure that the gradient descent moves smoothly towards the minima
and that the steps for gradient descent are updated at the same rate for all
the features, we scale the data before feeding it to the model.

Having features on a similar scale can help the gradient descent converge
more quickly towards the minima.

Reference: https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-
machine-learning-normalization-standardization/

Why feature scalling is required in


distance Based Algorithms
Distance algorithms like KNN, K-means, and SVM are most affected by the
range of features. This is because behind the scenes they are using
distances between data points to determine their similarity.

For example, let’s say we have data containing high school CGPA scores of
students (ranging from 0 to 5) and their future incomes (in thousands
Rupees):

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Since both the features have different scales, there is a chance that higher
weightage is given to features with higher magnitude. This will impact the
performance of the machine learning algorithm and obviously, we do not
want our algorithm to be biassed towards one feature.

Therefore, we scale our data before employing a distance based algorithm so


that all the features contribute equally to the result.

The effect of scaling is conspicuous(clearly visible) when we compare the


Euclidean distance between data points for students A and B, and between B
and C, before and after scaling as shown below:

Distance AB before scaling =>

Distance BC before scaling =>

Distance AB after scaling =>

Distance BC after scaling =>

Scaling has brought both the features into the picture and the distances are
now more comparable than they were before we applied scaling.

Reference: https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-
machine-learning-normalization-standardization/
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Why feature scaling not required in
tree based algorithms
Tree-based algorithms, on the other hand, are fairly insensitive to the scale
of the features. Think about it, a decision tree is only splitting a node based
on a single feature. The decision tree splits a node on a feature that
increases the homogeneity of the node. This split on a feature is not
influenced by other features.
So, there is virtually no effect of the remaining features on the split. This is
what makes them invariant to the scale of the features!
Reference: https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-
machine-learning-normalization-standardization/

Explain the difference between train,


validation and test set
Training set is used for model training
Validation set is used for model fine tuning
Test set is used for model testing. i.e. evaluating the models predictive
power and generalization.

What is Naive Bayes algorithm?


Naive Bayes classifier assumes that all the features(predictors) are
independent of each other.
Naive Bayes model is easy to build and particularly useful for very large data
sets. Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.
For more details refer: https://www.machinelearningplus.com/predictive-
modeling/how-naive-bayes-algorithm-works-with-example-and-full-code/

What is the difference between MLOps


and DevOps?
MLOps & DevOps have a lot of things in common. However, DevOps include
developing and deploying the software application code in production and
this code is usually static and does not change rapidly.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
MLOps on the other side also includes developing and deploying the ML code
in production. However, here the data changes rapidly and the up-gradation
of models has to happen more frequently than typical software application
code.
Reference: https://360digitmg.com/mlops-interview-questions-answers

What are the risks associated with


Data Science & how MLOps can
overcome the same?
Data Science typically has the following issues:
Model goes down without an alert and becomes unavailable
Model gives incorrect predictions for a given observation that cannot be
scrutinized further
Model accuracy decreases further as and how time progresses
Model maintenance also should be done by data scientists, who are
expensive
Model scaling across the organization is not easy
These risks can be addressed by using MLOps.
Reference: https://360digitmg.com/mlops-interview-questions-answers

Explain about model/concept drift.


Model drift, sometimes called concept drift, occurs when the model
performance during the inference phase (using real-world data) degrades
when compared to its performance during the training phase (using
historical, labeled data).

It is also known as train/serve skew as the performance of the model is


skewed when compared with the training and serving phases. This could be
due to many reasons like

The underlying distribution of data has changed


Unforeseen events - like a model trained on pre-covid data is expected to
perform much worse on data during the COVID-19 pandemic
Training happened on a limited number of categories but a recent
environmental change happened which added another category
In NLP problems the real world data has significantly more number of
tokens that are different from training data
To detect model drift, it is always necessary to keep continuously monitoring
the performance of the model.
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
If there is a sustained degradation of model performance, the cause needs to
be investigated and treatment methods need to be applied accordingly
which almost always involves model retraining.

Reference: https://360digitmg.com/mlops-interview-questions-answers

Use NLP to read T & C


Reference: https://dataistdogma.github.io/NLP.html

What are the differences between


XGBoost and Random Forest Model
TODO

Please explain p-value to someone


non-technical
This is a really important question because as a data scientist you
are expected to explain lots of technical information to business
users.

What do p-value represent


Explain with simple example when it is ok to use p-value

TODO

Average comments per month has


dropped over three-month period,
despite consistent growth after a new
launch. What metric would u
investigate?
TODO

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
A PM tells you that a weekly active
user metric is up by 5% but email
notification open rate is down by 2%.
WHat would you investigate to dignose
this problem?
Email open rate is calculated by dividing the number of emails opened by the
number of emails sent minus any bounces. A good open rate is between 17-
28%2. Email notification open rate is a type of email open rate that measures
how many users open an email that notifies them about something.

Weekly active user metric (WAU) is a measure of how many users are active on a
website or app in a given week. It can be influenced by many factors, such as
user acquisition, retention, engagement and churn.

To diagnose the problem of WAU being up but email notification open rate being
down, you might want to investigate:

How are you defining active users? Are they performing meaningful actions on
your website or app that indicate engagement and loyalty? How are you
segmenting your users based on their behavior, preferences and needs? Are you
sending relevant and personalized email notifications to each segment? How are
you optimizing your email subject lines, preheaders, sender names and content
to capture attention and interest? Are you using clear and compelling calls to
action? How are you testing and measuring your email performance? Are you
using tools like A/B testing, analytics and feedback surveys to improve your
email strategy?

References
https://www.youtube.com/watch?v=k6QWYwOvJs0&t=1149s
https://towardsdatascience.com/taking-the-confusion-out-of-confusion-
matrices-c1ce054b3d3e
https://kambria.io/blog/confused-about-the-confusion-matrix-learn-all-about-
it/#:~:text=Confusion%20matrices%20are%20used%20to,True%20Negatives%2
https://projects.uplevel.work/insights/confusion-matrix-accuracy-sensitivity-
specificity-precision-f1-score-how-to-interpret
https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-
learning/
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
https://towardsdatascience.com/imbalanced-classification-in-python-smote-
enn-method-
db5db06b8d50#:~:text=The%20Concept%3A%20Edited%20Nearest%20Neighb
https://www.youtube.com/watch?v=Aarb0_Cw_48&ab_channel=JayFeng

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

You might also like