Data Science Interview Questions
Data Science Interview Questions
Data Science Interview Questions
A data set used for performance evaluation is called test data set. It should
contain the correct labels and predicted labels.
The predicted labels will exactly the same if the performance of a binary
classifier is perfect.
The predicted labels usually match with part of the observed labels in real world
scenarios.
A binary classifier predicts all data instances of a test dataset as either positive or
negative. This produces four outcomes-
1. True positive(TP) — Correct positive prediction
2. False positive(FP) — Incorrect positive prediction
3. True negative(TN) — Correct negative prediction
4. False negative(FN) — Incorrect negative prediction
In the above diagram we see that the thinner lines mark the distance from the
classifier to the closest data points called the support vectors (darkened data
points). The distance between the two thin lines is called the margin.
10. What are the different kernels functions in SVM ?
There are four types of kernels in SVM.
1. Linear Kernel
2. Polynomial kernel
3. Radial basis kernel
4. Sigmoid kernel
11. Explain Decision Tree algorithm in detail.
Decision tree is a supervised machine learning algorithm mainly used for
the Regression and Classification.It breaks down a data set into smaller and
smaller subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree with decision nodes and leaf
nodes. Decision tree can handle both categorical and numerical data.
Information Gain
The Information Gain is based on the decrease in entropy after a dataset is split
on an attribute. Constructing a decision tree is all about finding attributes that
returns the highest information gain.
Data is usually distributed in different ways with a bias to the left or to the right
or it can all be jumbled up. However, there are chances that data is distributed
around a central value without any bias to the left or right and reaches normal
distribution in the form of a bell shaped curve. The random variables are
distributed in the form of an symmetrical bell shaped curve.
19. What is a Box Cox Transformation?
Dependent variable for a regression analysis might not satisfy one or more
assumptions of an ordinary least squares regression. The residuals could either
curve as the prediction increases or follow skewed distribution. In such
scenarios, it is necessary to transform the response variable so that the data
meets the required assumptions. A Box cox transformation is a statistical
technique to transform non-normal dependent variables into a normal shape. If
the given data is not normal then most of the statistical techniques assume
normality. Applying a box cox transformation means that you can run a broader
number of tests.
Red circled point in above graph i.e. Number of Cluster =6 is the point after
which you don’t see any decrement in WSS. This point is known as bending point
and taken as K in K — Means.This is the widely used approach but few data
scientists also use Hierarchical clustering first to create dendograms and identify
the distinct groups from there.
21. What is deep learning?
Deep learning is sub field of machine learning inspired by structure and function
of brain called artificial neural network. We have a lot numbers of algorithms
under machine learning like Linear regression, SVM, Neural network etc and
deep learning is just an extension of Neural networks. In neural nets we consider
small number of hidden layers but when it comes to deep learning algorithms we
consider a huge number of hidden layers to better understand the input output
relationship.
22. What are Recurrent Neural Networks(RNNs) ?
Recurrent nets are type of artificial neural networks designed to recognise
pattern from the sequence of data such as Time series, stock market and
government agencies etc. To understand recurrent nets, first you have to
understand the basics of feed forward nets. Both these networks RNN and feed
forward named after the way they channel information through a series of
mathematical orations performed at the nodes of the network. One feeds
information through straight(never touching same node twice), while the other
cycles it through loop, and the latter are called recurrent.
Recurrent networks on the other hand, take as their input not just the current
input example they see, but also the what they have perceived previously in time.
The BTSXPE at the bottom of the drawing represents the input example in the
current moment, and CONTEXT UNIT represents the output of the previous
moment. The decision a recurrent neural network reached at time t-1 affects the
decision that it will reach one moment later at time t. So recurrent networks have
two sources of input, the present and the recent past, which combine to
determine how they respond to new data, much as we do in life.
The error they generate will return via back propagation and be used to adjust
their weights until error can’t go any lower. Remember, the purpose of recurrent
nets is to accurately classify sequential input. We rely on the back propagation of
error and gradient descent to do so.
Back propagation in feed forward networks moves backward from the final error
through the outputs, weights and inputs of each hidden layer, assigning those
weights responsibility for a portion of the error by calculating their partial
derivatives — ∂E/∂w, or the relationship between their rates of change. Those
derivatives are then used by our learning rule, gradient descent, to adjust the
weights up or down, whichever direction decreases error.
Recurrent networks rely on an extension of back propagation called back
propagation through time, or BPTT. Time, in this case, is simply expressed by a
well-defined, ordered series of calculations linking one time step to the next,
which is all back propagation needs to work.
What is Naive ?
The Algorithm is ‘naive’ because it makes assumptions that may or may not turn
out to be correct.
33. Why we generally use Softmax non-linearity function as last operation
in network ?
It is because it takes in a vector of real numbers and returns a probability
distribution. Its definition is as follows. Let x be a vector of real numbers
(positive, negative, whatever, there are no constraints). Then the i’th component
of Softmax(x) is —
Boxplot vs Histogram
While boxplots and histograms are visualizations used to show the distribution
of the data, they communicate information differently.
Histograms are bar charts that show the frequency of a numerical variable’s
values and are used to approximate the probability distribution of the given
variable. It allows you to quickly understand the shape of the distribution, the
variation, and potential outliers.
Boxplots communicate different aspects of the distribution of data. While you
can’t see the shape of the distribution through a box plot, you can gather other
information like the quartiles, the range, and outliers. Boxplots are especially
useful when you want to compare multiple charts at the same time because they
take up less space than histograms.
Created by author
Unlike supervised learning, unsupervised learning is used to draw inferences
and find patterns from input data without references to labeled outcomes. A
common use of unsupervised learning is grouping customers by purchasing
behavior to find target markets.
Check out my article ‘All Machine Learning Models Explained in Six Minutes’ if
you’d like to learn more about this!
Q: Assume you need to generate a predictive model using multiple
regression. Explain how you intend to validate this model
There are two main ways that you can do this:
A) Adjusted R-squared.
R Squared is a measurement that tells you to what extent the proportion of
variance in the dependent variable is explained by the variance in the
independent variables. In simpler terms, while the coefficients estimate trends,
R-squared represents the scatter around the line of best fit.
However, every additional independent variable added to a
model always increases the R-squared value — therefore, a model with several
independent variables may seem to be a better fit even if it isn’t. This is where
adjusted R² comes in. The adjusted R² compensates for each additional
independent variable and only increases if each given variable improves the
model above what is possible by probability. This is important since we are
creating a multiple regression model.
B) Cross-Validation
A method common to most people is cross-validation, splitting the data into two
sets: training and testing data. See the answer to the first question for more on this.
Q: What does NLP stand for?
NLP stands for Natural Language Processing. It is a branch of artificial
intelligence that gives machines the ability to read and understand human
languages.
Q: When would you use random forests Vs SVM and why?
There are a couple of reasons why a random forest is a better choice of model
than a support vector machine:
• Random forests allow you to determine the feature importance. SVM’s
can’t do this.
• Random forests are much quicker and simpler to build than an SVM.
• For multi-class classification problems, SVMs require a one-vs-rest
method, which is less scalable and more memory intensive.
Q: Why is dimension reduction important?
Dimensionality reduction is the process of reducing the number of features in a
dataset. This is important mainly in the case when you want to reduce variance
in your model (overfitting).
Wikipedia states four advantages of dimensionality reduction (see here):
1. It reduces the time and storage space required
2. Removal of multi-collinearity improves the interpretation of the
parameters of the machine learning model
3. It becomes easier to visualize the data when reduced to very low
dimensions such as 2D or 3D
4. It avoids the curse of dimensionality
Q: What is principal component analysis? Explain the sort of
problems you would use PCA for.
In its simplest sense, PCA involves project higher dimensional data (eg. 3
dimensions) to a smaller space (eg. 2 dimensions). This results in a lower
dimension of data, (2 dimensions instead of 3 dimensions) while keeping all
original variables in the model.
PCA is commonly used for compression purposes, to reduce required memory
and to speed up the algorithm, as well as for visualization purposes, making it
easier to summarize data.
Q: Why is Naive Bayes so bad? How would you improve a spam
detection algorithm that uses naive Bayes?
One major drawback of Naive Bayes is that it holds a strong assumption in that
the features are assumed to be uncorrelated with one another, which typically is
never the case.
One way to improve such an algorithm that uses Naive Bayes is by decorrelating
the features so that the assumption holds true.
Q: What are the drawbacks of a linear model?
There are a couple of drawbacks of a linear model:
• A linear model holds some strong assumptions that may not be true in
application. It assumes a linear relationship, multivariate normality,
no or little multicollinearity, no auto-correlation, and
homoscedasticity
• A linear model can’t be used for discrete or binary outcomes.
• You can’t vary the model flexibility of a linear model.
Q: Do you think 50 small decision trees are better than a large
one? Why?
Another way of asking this question is “Is a random forest a better model than a
decision tree?” And the answer is yes because a random forest is an ensemble
method that takes many weak decision trees to make a strong learner. Random
forests are more accurate, more robust, and less prone to overfitting.
Q: Why is mean square error a bad measure of model
performance? What would you suggest instead?
Mean Squared Error (MSE) gives a relatively high weight to large errors —
therefore, MSE tends to put too much emphasis on large deviations. A more
robust alternative is MAE (mean absolute deviation).
Q: What are the assumptions required for linear regression? What
if some of these assumptions are violated?
The assumptions are as follows:
1. The sample data used to fit the model is representative of the
population
2. The relationship between X and the mean of Y is linear
3. The variance of the residual is the same for any value of
X (homoscedasticity)
4. Observations are independent of each other
5. For any value of X, Y is normally distributed.
Extreme violations of these assumptions will make the results redundant. Small
violations of these assumptions will result in a greater bias or variance of the
estimate.
Q: What is collinearity and what to do with it? How to remove
multicollinearity?
Multicollinearity exists when an independent variable is highly correlated with
another independent variable in a multiple regression equation. This can be
problematic because it undermines the statistical significance of an independent
variable.
You could use the Variance Inflation Factors (VIF) to determine if there is any
multicollinearity between independent variables — a standard benchmark is
that if the VIF is greater than 5 then multicollinearity exists.
Q: How to check if the regression model fits the data well?
There are a couple of metrics that you can use:
R-squared/Adjusted R-squared: Relative measure of fit. This was explained in a
previous answer
F1 Score: Evaluates the null hypothesis that all regression coefficients are equal
to zero vs the alternative hypothesis that at least one doesn’t equal zero
RMSE: Absolute measure of fit.
Q: What is a decision tree?
For example, if we created one decision tree, the third one, it would predict 0. But
if we relied on the mode of all 4 decision trees, the predicted value would be 1.
This is the power of random forests.
Random forests offer several other benefits including strong performance, can
model non-linear boundaries, no cross-validation needed, and gives feature
importance.
Q: What is a kernel? Explain the kernel trick
A kernel is a way of computing the dot product of two vectors 𝐱x and 𝐲y in some
(possibly very high dimensional) feature space, which is why kernel functions
are sometimes called “generalized dot product” [2]
The kernel trick is a method of using a linear classifier to solve a non-linear
problem by transforming linearly inseparable data to linearly separable ones in a
higher dimension.
Taken from Analytics Vidhya
Q: Is it beneficial to perform dimensionality reduction before
fitting an SVM? Why or why not?
When the number of features is greater than the number of observations, then
performing dimensionality reduction will generally improve the SVM.
Q: What is overfitting?
Assume that the probability of picking the unfair coin is denoted as P(A) and the
probability of flipping 10 heads in a row is denoted as P(B). Then P(B|A) is equal
to 1, P(B∣¬A) is equal to 0.⁵¹⁰, and P(¬A) is equal to 0.99.
If you fill in the equation, then P(A|B) = 0.9118 or 91.18%.
Q: Difference between convex and non-convex cost function; what
does it mean when a cost function is non-convex?
From Wikipedia
Statistics How To provides the best definition of CLT, which is:
“The central limit theorem states that the sampling distribution of the sample
mean approaches a normal distribution as the sample size gets larger no matter
what the shape of the population distribution.” [1]
The central limit theorem is important because it is used in hypothesis testing
and also to calculate confidence intervals.
Q: What is the statistical power?
‘Statistical power’ refers to the power of a binary hypothesis, which is the
probability that the test rejects the null hypothesis given that the alternative
hypothesis is true.
Q: Explain selection bias (with regard to a dataset, not variable
selection). Why is it important? How can data management
procedures such as missing data handling make it worse?
Selection bias is the phenomenon of selecting individuals, groups or data for
analysis in such a way that proper randomization is not achieved, ultimately
resulting in a sample that is not representative of the population.
Understanding and identifying selection bias is important because it can
significantly skew results and provide false insights about a particular
population group.
Types of selection bias include:
• sampling bias: a biased sample caused by non-random sampling
• time interval: selecting a specific time frame that supports the
desired conclusion. e.g. conducting a sales analysis near Christmas.
• exposure: includes clinical susceptibility bias, protopathic bias,
indication bias. Read more here.
• data: includes cherry-picking, suppressing evidence, and the fallacy of
incomplete evidence.
• attrition: attrition bias is similar to survivorship bias, where only
those that ‘survived’ a long process are included in an analysis, or
failure bias, where those that ‘failed’ are only included
• observer selection: related to the Anthropic principle, which is a
philosophical consideration that any data we collect about the
universe is filtered by the fact that, in order for it to be observable, it
must be compatible with the conscious and sapient life that observes
it. [3]
Handling missing data can make selection bias worse because different methods
impact the data in different ways. For example, if you replace null values with the
mean of the data, you adding bias in the sense that you’re assuming that the data
is not as spread out as it might actually be.
Q: Provide a simple example of how an experimental design can
help answer a question about behavior. How does experimental
data contrast with observational data?
Observational data comes from observational studies which are when you
observe certain variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you
control certain variables and hold them constant to determine if there is any
causality.
An example of experimental design is the following: split a group up into two.
The control group lives their lives normally. The test group is told to drink a glass
of wine every night for 30 days. Then research can be conducted to see how wine
affects sleep.
Q: Is mean imputation of missing data acceptable practice? Why or
why not?
Mean imputation is the practice of replacing null values in a data set with the
mean of the data.
Mean imputation is generally bad practice because it doesn’t take into account
feature correlation. For example, imagine we have a table showing age and
fitness score and imagine that an eighty-year-old has a missing fitness score. If
we took the average fitness score from an age range of 15 to 80, then the eighty-
year-old will appear to have a much higher fitness score that he actually should.
Second, mean imputation reduces the variance of the data and increases bias in
our data. This leads to a less accurate model and a narrower confidence interval
due to a smaller variance.
Q: What is an outlier? Explain how you might screen for outliers
and what would you do if you found them in your dataset. Also,
explain what an inlier is and how you might screen for them and
what would you do if you found them in your dataset.
An outlier is a data point that differs significantly from other observations.
Depending on the cause of the outlier, they can be bad from a machine learning
perspective because they can worsen the accuracy of a model. If the outlier is
caused by a measurement error, it’s important to remove them from the dataset.
There are a couple of ways to identify outliers:
Z-score/standard deviations: if we know that 99.7% of data in a data set lie
within three standard deviations, then we can calculate the size of one standard
deviation, multiply it by 3, and identify the data points that are outside of this
range. Likewise, we can calculate the z-score of a given point, and if it’s equal to
+/- 3, then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using
this method; the data must be normally distributed, this is not applicable for
small data sets, and the presence of too many outliers can throw off z-score.
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be
used to identify outliers. The IQR is equal to the difference between the 3rd
quartile and the 1st quartile. You can then identify if a point is an outlier if it is
less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately
2.698 standard deviations.
| Id | Salary |
+----+--------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
+----+--------+
SOLUTION A: Using IFNULL, OFFSET
• IFNULL(expression, alt) : ifnull() returns the specified value if null,
otherwise returns the expected value. We’ll use this to return null if
there’s no second-highest salary.
• OFFSET : offset is used with the ORDER BY clause to disregard the top
n rows that you specify. This will be useful as you’ll want to get the
second row (2nd highest salary)
SELECT
IFNULL(
FROM Employee
LIMIT 1 OFFSET 1
), null) as SecondHighestSalary
FROM Employee
LIMIT 1
SOLUTION B: Using MAX()
This query says to choose the MAX salary that isn’t equal to the MAX salary,
which is equivalent to saying to choose the second-highest salary!
SELECT MAX(salary) AS SecondHighestSalary
FROM Employee
| Id | Email |
+----+---------+
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
+----+---------+
SOLUTION A: COUNT() in a Subquery
First, a subquery is created to show the count of the frequency of each email.
Then the subquery is filtered WHERE the count is greater than 1.
SELECT Email
FROM (
GROUP BY Email
) as email_count
FROM Person
GROUP BY Email
+---------+------------------+------------------+
| 1| 2015-01-01 | 10 |
| 2| 2015-01-02 | 25 |
| 3| 2015-01-03 | 20 |
| 4| 2015-01-04 | 30 |
+---------+------------------+------------------+
SOLUTION: DATEDIFF()
• DATEDIFF calculates the difference between two dates and is used to
make sure we’re comparing today’s temperature to yesterday’s
temperature.
In plain English, the query is saying, Select the Ids where the temperature on a
given day is greater than the temperature yesterday.
SELECT DISTINCT a.Id
+----+-------+--------+--------------+
| 1 | Joe | 70000 | 1 |
| 2 | Jim | 90000 | 1 |
| 3 | Henry | 80000 | 2 |
| 4 | Sam | 60000 | 2 |
| 5 | Max | 90000 | 1 |
+----+-------+--------+--------------+
The Department table holds all departments of the company.
+----+----------+
| Id | Name |
+----+----------+
| 1 | IT |
| 2 | Sales |
+----+----------+
Write a SQL query to find employees who have the highest salary in each of the
departments. For the above tables, your SQL query should return the following
rows (order of rows does not matter).
+------------+----------+--------+
+------------+----------+--------+
| IT | Max | 90000 |
| IT | Jim | 90000 |
+------------+----------+--------+
SOLUTION: IN Clause
• The IN clause allows you to use multiple OR clauses in a WHERE
statement. For example WHERE country = ‘Canada’ or country = ‘USA’
is the same as WHERE country IN (‘Canada’, ’USA’).
• In this case, we want to filter the Department table to only show the
highest Salary per Department (i.e. DepartmentId). Then we can join
the two tables WHERE the DepartmentId and Salary is in the filtered
Department table.
SELECT
Department.name AS 'Department',
Employee.name AS 'Employee',
Salary
FROM Employee
IN
( SELECT
DepartmentId, MAX(Salary)
FROM
Employee
GROUP BY DepartmentId
)
PROBLEM #5: Exchange Seats
Mary is a teacher in a middle school and she has a table seat storing students' names
and their corresponding seat ids. The column id is a continuous increment. Mary
wants to change seats for the adjacent students.
Can you write a SQL query to output the result for Mary?
+---------+---------+
| id | student |
+---------+---------+
| 1 | Abbot |
| 2 | Doris |
| 3 | Emerson |
| 4 | Green |
| 5 | Jeames |
+---------+---------+
For the sample input, the output is:
+---------+---------+
| id | student |
+---------+---------+
| 1 | Doris |
| 2 | Abbot |
| 3 | Green |
| 4 | Emerson |
| 5 | Jeames |
+---------+---------+
Note:
If the number of students is odd, there is no need to change the last one’s seat.
SOLUTION: CASE WHEN
• Think of a CASE WHEN THEN statement like an IF statement in
coding.
• The first WHEN statement checks to see if there’s an odd number of
rows, and if there is, ensure that the id number does not change.
• The second WHEN statement adds 1 to each id (eg. 1,3,5 becomes
2,4,6)
• Similarly, the third WHEN statement subtracts 1 to each id (2,4,6
becomes 1,3,5)
SELECT
CASE
ELSE id - 1
FROM seat
ORDER BY id
Miscellaneous
Q: If there are 8 marbles of equal weight and 1 marble that weighs
a little bit more (for a total of 9 marbles), how many weighings are
required to determine which marble is the heaviest?