Data Science Interview Questions
Data Science Interview Questions
Home / Interview Question / Top 100+ Data Science Interview Questions in 2024
Data Science 11
Q4. How R is Useful in the Data Science Domain?
Q5. What is Supervised Learning?
Database 9 Q6. What is Unsupervised Learning?
Q7. What do you understand about Linear Regression?
Digital Marketing 3
Q8. What do you understand by logistic regression?
Electric Vehicle 1 Q9. What is a confusion matrix?
Q10. What do you understand about the true-positive rate and false-positive rate?
Investment Banking 1
Following are the three categories into which these Data Science interview questions are divided:
Mobile Development 2
1. Basic Level
No-SQL 6
2. Intermediate Level
Programming 19
3. Advanced Level
Project Management 6
Testing 5
Trending 33
UI UX 2
Website Development 9
Data Analytics is a subset of Data Science. Data Science is a broad technology that includes
various subsets such as Data Analytics, Data
Mining, Data Visualization, etc.
The goal of data analytics is to illustrate the The goal of data science is to discover
precise details of retrieved insights. meaningful insights from massive datasets and
derive the best possible solutions to resolve
business issues.
It focuses on just finding the solutions. Data Science not only focuses on finding
solutions but also predicts the future with past
patterns or insights.
A data analyst’s job is to analyze data in order to A data scientist’s job is to provide insightful data
make decisions. visualizations from raw data that are easily
understandable.
Become an expert in Data Scientist. Enroll now in PG program in Data Science and Machine Learning
from MITxMicroMasters
Its syntax is meticulously designed to be intuitive and concise, enabling ease in coding, comprehension,
and maintenance. Additionally, Python offers a comprehensive standard library that encompasses a
diverse collection of pre-built modules and functions. This wealth of resources substantially minimizes
the time and effort expended by developers, streamlining the execution of routine programming tasks.
Data Manipulation and Analysis: R offers a comprehensive collection of libraries and functions that
facilitate proficient data manipulation, transformation, and statistical analysis.
Statistical Modeling and Machine Learning: R offers a wide range of packages for advanced
statistical modeling and machine learning tasks, empowering data scientists to build predictive
models and perform complex analyses.
Data Visualization: R’s extensive visualization libraries enable the creation of visually appealing and
insightful plots, charts, and graphs.
Reproducible Research: R supports the integration of code, data, and documentation, facilitating
reproducible workflows and ensuring transparency in data science projects.
By providing your contact details, you agree to our Terms of Use & Privacy Policy
Temperature and humidity are the independent variables, and rain would be our dependent variable.
So, the logistic regression algorithm actually produces an S shape curve.
So, basically in logistic regression, the Y value lies within the range of 0 and 1. This is how logistic
regression works.
10. What do you understand about the true-positive rate and false-
positive rate?
True positive rate: In Machine Learning, true-positive rates, which are also referred to as sensitivity or
recall, are used to measure the percentage of actual positives which are correctly identified. Formula:
False positive rate: False positive rate is basically the probability of falsely rejecting the null hypothesis
for a particular test. The false-positive rate is calculated as the ratio between the number of negative
events wrongly categorized as positive (false positive) upon the total number of actual events. Formula:
In traditional programming paradigms, we used to analyze the input, figure out the expected output,
and write code, which contains rules and statements needed to transform the provided input into the
expected output. As we can imagine, these rules were not easy to write, especially, for data that even
computers had a hard time understanding, e.g., images, videos, etc.
Data Science shifts this process a little bit. In it, we need access to large volumes of data that contain
the necessary inputs and their mappings to the expected outputs. Then, we use data science
algorithms, which use mathematical analysis to generate rules to map the given inputs to outputs.
This process of rule generation is called training. After training, we use some data that was set aside
before the training phase to test and check the system’s accuracy. The generated rules are a kind of
black box, and we cannot understand how the inputs are being transformed into outputs.
However, If the accuracy is good enough, then we can use the system (also called a model).
As described above, in traditional programming, we had to write the rules to map the input to the
output, but in Data Science, the rules are automatically generated or learned from the given data. This
helped solve some really difficult challenges that were being faced by several companies.
Interested to learn Data Science skills? Check our Data Science course in Kottayam Now!
Works on the data that contains both inputs Works on the data that contains no mappings
and the expected output, i.e., the labeled data from input to output, i.e., the unlabeled data
Used to create models that can be employed Used to extract meaningful information out of
to predict or classify things large volumes of data
13. What is the difference between long format data and wide format
data?
A long format data has a column for Whereas, Wide data has a column for each variable.
possible variable types and a column for
the values of those variables.
Each row in the long format represents a The repeated responses of a subject will be in a single
one-time point per subject. As a result, row, with each response in its own column, in the wide
each topic will contain many rows of format.
data.
This data format is most typically used in This data format is most widely used in data
R analysis and for writing to log files at manipulations, and stats programmes for repeated
the end of each experiment. measures ANOVAs and is seldom used in R analysis.
A long format contains values that do A wide format contains values that do not repeat in the
repeat in the first column. first column.
Use df.melt() to convert the wide form to use df.pivot().reset_index() to convert the long form
long form into wide form
14. Mention some techniques used for sampling. What is the main
advantage of sampling?
Sampling is defined as the process of selecting a sample from a group of people or from any particular
kind for research purposes. It is one of the most important factors which decides the accuracy of a
research/survey result.
Probability sampling: It involves random selection which makes every element get a chance to be
selected. Probability sampling has various subtypes in it, as mentioned below:
Non- Probability Sampling: Non-probability sampling follows non-random selection which means the
selection is done based on your ease or any other required criteria. This helps to collect the data easily.
The following are various types of sampling in it:
Convenience Sampling
Purposive Sampling
Quota Sampling
Referral /Snowball Sampling
15. What is bias in data science?
Bias is a type of error that occurs in a data science model because of using an algorithm that is not
strong enough to capture the underlying patterns or trends that exist in the data. In other words, this
error occurs when the data is too complicated for the algorithm to understand, so it ends up building a
model that makes simple assumptions. This leads to lower accuracy because of underfitting. Algorithms
that can lead to high bias are linear regression, logistic regression, etc.
Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively used for data clea
ning and analysis. These libraries are used to load and clean the data and do effective analysis. For
instance, you might decide to remove outliers that are beyond a certain standard deviation from the
mean of a numerical column.
mean = df['Price'].mean()
std = df['Price'].std()
Hence, this is how the process of data cleaning is done using python libraries in the field of data
science.
It has better data management and supports distributed computing by splitting the operations
between multiple tasks and nodes, which eventually decreases the complexity and execution time of
large datasets.
TensorFlow: Supports parallel computing with impeccable library management backed by Google.
SciPy: Mainly used for solving differential equations, multidimensional programming, data
manipulation, and visualization through graphs and charts.
Pandas: Used to implement the ETL(Extracting, Transforming, and Loading the datasets) capabilities
in business applications.
Matplotlib: Being free and open-source, it can be used as a replacement for MATLAB, which results
in better performance and low memory consumption.
PyTorch: Best for projects which involve ,machine learning algorithms and deep neural networks.
Interested to learn more about Data Science, check out our Data Science Course in New York!
Cost function: Also referred to as the objective function, the cost function holds substantial utility
within machine learning algorithms, especially in optimization scenarios. Its purpose is to quantify the
disparity between predicted values and actual values. Minimizing the cost function entails optimizing
the model’s parameters or coefficients, aiming to achieve an optimal solution.
Loss function: Loss functions possess significant significance in supervised learning endeavors. They
evaluate the discrepancy or error between predicted values and actual labels. The selection of a specific
loss function depends on the problem at hand, such as employing mean squared error (MSE) for
regression tasks or cross-entropy loss for classification tasks. The loss function guides the model’s
optimization process during training, ultimately bolstering accuracy and overall performance.
For example, imagine that we have a movie streaming platform, similar to Netflix or Amazon Prime. If a
user has previously watched and liked movies from action and horror genres, then it means that the
user likes watching movies of these genres. In that case, it would be better to recommend such movies
to this particular user. These recommendations can also be generated based on what users with
similar tastes like watching.
Learn how to make sure people type in the right email on your website. It’s easy with JavaScript – read
email validation in JavaScript
Data may also be distributed around a central value, i.e., mean, median, etc. This kind of distribution
has no bias either to the left or to the right and is in the form of a bell-shaped curve. This distribution
also has its mean equal to the median. This kind of distribution is called a normal distribution.
Deep Learning is an advanced version of neural networks to make the machines learn from data. In
Deep Learning, the neural networks comprise many hidden layers (which is why it is called ‘deep’
learning) that are connected to each other, and the output of the previous layer is the input of the
current layer.
29. Between Python and R, which one will you choose for analyzing the
text, and why?
Due to the following factors, Python will outperform R for text analytics:
Python’s Pandas module provides high-performance data analysis capabilities as well as simple-to-
use data structures.
Python does all sorts of text analytics more quickly.
31. What do you understand from Recommender System? and State its
application
Recommender Systems are a subclass of information filtering systems designed to forecast the
preferences or ratings given to a product by a user.
The Amazon product suggestions page is an example of a recommender system in use. Based on the
user’s search history and previous orders, this area contains products.
33. What are the various skills required to become Data Scientist?
The following abilities are necessary to become a certified Data Scientist:
Having familiarity with built-in data types like lists, tuples, sets, and related.
N-dimensional NumPy array knowledge is required.
Being able to use Pandas and Dataframes.
Strong holdover performance in vectors with only one element.
Hands-on experience with Tableau and PowerBI.
Caffe
Keras
TensorFlow
Pytorch
Chainer
Microsoft Cognitive Toolkit
These examples represent only a fraction of the available variations and architectures tailored to
specific data types and problem domains.
Intermediate Data Science Interview Questions
Here, each node denotes the test on an attribute, and each edge denotes the outcome of that
attribute, and each leaf node holds the class label. So, in this case, we have a series of test conditions
which give the final decision according to the condition.
Are you interested in learning Data Science from experts? Enroll in our Data Science Course in Bangalo
re now!
41. Two candidates, Aman and Mohan appear for a Data Science Job
interview. The probability of Aman cracking the interview is 1/8 and that
of Mohan is 5/12. What is the probability that at least one of them will
crack the interview?
The probability of Aman getting selected for the interview is 1/8
P(A) = 1/8
P(B)=5/12
Now, the probability of at least one of them getting selected can be denoted at the Union of A and B,
which means
Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job.
To calculate the final answer, we first have to find out the value of P(A ∩ B)
1/8 * 5/12
5/96
Database Design: This is the process of designing the database. The database design creates an output
which is a detailed data model of the database. Strictly speaking, database design includes the detailed
logical model of a database but it can also include physical design choices and storage parameters.
Here, it gives the minimum and maximum values from a specific column of the dataset. Also, it provides
the median, mean, 1st quartile, and 3rd quartile values that help us understand the values better.
50. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood.
Both of them deal with data. However, there are some fundamental distinctions that show us how they
are different from each other.
Data Science is a broad field that deals with large volumes of data and allows us to draw insights from
this voluminous data. The entire process of data science takes care of multiple steps that are involved
in drawing insights out of the available data. This process includes crucial steps such as data gathering,
data analysis, data manipulation, data visualization, etc.
Machine Learning, on the other hand, can be thought of as a sub-field of data science. It also deals with
data, but here, we are solely focused on learning how to convert the processed data into a functional
model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input
and tell us if that image contains a flower as an output.
In short, data science deals with gathering data, processing it, and finally, drawing insights from it. The
field of data science that deals with building models using algorithms is called machine learning.
Therefore, machine learning is an integral part of data science.
Univariate analysis: Univariate analysis involves analyzing data with only one variable or, in other
words, a single column or a vector of the data. This analysis allows us to understand the data and
extract patterns and trends from it. Example: Analyzing the weight of a group of people.
Bivariate analysis: Bivariate analysis involves analyzing the data with exactly two variables or, in
other words, the data can be put into a two-column table. This kind of analysis allows us to figure
out the relationship between the variables. Example: Analyzing the data that contains temperature
and altitude.
Multivariate analysis: Multivariate analysis involves analyzing the data with more than two variables.
The number of columns of the data can be anything more than two. This kind of analysis allows us
to figure out the effects of all other variables (input variables) on a single variable (the output
variable).
Example: Analyzing data about house prices, which contains information about the houses, such as
locality, crime rate, area, the number of floors, etc.
For example, if in a column the majority of the data is missing, then dropping the column is the best
option, unless we have some means to make educated guesses about the missing values. However, if
the amount of missing data is low, then we have several strategies to fill them up.
One way would be to fill them all up with a default value or a value that has the highest frequency in that
column, such as 0 or 1, etc. This may be useful if the majority of the data in that column contains these
values.
Another way is to fill up the missing values in the column with the mean of all the values in that column.
This technique is usually preferred as the missing values have a higher chance of being closer to the
mean than to the mode.
Finally, if we have a huge dataset and a few rows have values missing in some columns, then the easiest
and fastest way is to drop those columns. Since the dataset is large, dropping a few columns should not
be a problem anyway.
The reason why data with high dimensions is considered so difficult to deal with is that it leads to high
time consumption while processing the data and training a model on it. Reducing dimensions speeds
up this process, removes noise, and also leads to better model accuracy.
Bias is an error that occurs when a model is too simple to capture the patterns in a dataset. To reduce
bias, we need to make our model more complex. Although making the model more complex can lead to
reducing bias, if we make the model too complex, it may end up becoming too rigid, leading to high
variance. So, the tradeoff between bias and variance is that if we increase the complexity, the bias
reduces and the variance increases, and if we reduce complexity, the bias increases and the variance
reduces. Our goal is to find a point at which our model is complex enough to give low bias but not so
complex to end up having high variance.
First, we calculate the errors in the predictions made by the regression model. For this, we calculate the
differences between the actual and the predicted values. Then, we square the errors.
After this step, we calculate the mean of the squared errors, and finally, we take the square root of the
mean of these squared errors. This number is the RMSE and a model with a lower value of RMSE is
considered to produce lower errors, i.e., the model will be more accurate.
This is calculated as the sum of squares of the distances of all values in a cluster. As k starts from a low
value and goes up to a high value, we start seeing a sharp decrease in the inertia value. After a certain
value of k, in the range, the drop in the inertia value becomes quite small. This is the value of k that we
need to choose for the k-means clustering algorithm.
In case the outliers are not that extreme, then we can try:
A different kind of model. For example, if we were using a linear model, then we can choose a non-
linear model
Normalizing the data, which will shift the extreme values closer to other data points
Using algorithms that are not so affected by outliers, such as random forest, etc.
To calculate the accuracy, we need to divide the sum of the correctly classified observations by the
number of total observations.
However, sometimes some datasets are very complex, and it is difficult for one model to be able to
grasp the underlying trends in these datasets. In such situations, we combine several individual models
together to improve performance. This is what is called ensemble learning.
In other words, the content of the movie does not matter much. When recommending it to a user what
matters is if other users similar to that particular user liked the content of the movie or not.
For example, if a user is watching movies belonging to the action and mystery genre and giving them
good ratings, it is a clear indication that the user likes movies of this kind. If shown movies of a similar
genre as recommendations, there is a higher probability that the user would like those
recommendations as well.
In other words, here, the content of the movie is taken into consideration when generating
recommendations for users.
Once all the models are trained, then it’s time to make a prediction, we make predictions using all the
trained models and then average the result in the case of regression, and for classification, we choose
the result, generated by models, that have the highest frequency.
In doing so, we take the patterns learned by a previous model and test them on a dataset when training
the new model. In each iteration, we give more importance to observations in the dataset that are
incorrectly handled or predicted by previous models. Boosting is useful in reducing bias in models as
well.
However, in stacking, we can combine weak models that use different learning algorithms as well. These
learners are called heterogeneous learners. Stacking works by training multiple (and different) weak
models or learners and then using them together by training another model, called a meta-model, to
make predictions based on the multiple outputs of predictions returned by these multiple weak
models.
Deep Learning, on the other hand, is a field in machine learning that deals with building machine
learning models using algorithms that try to imitate the process of how the human brain learns from
the information in a system for it to attain new capabilities. In deep learning, we make heavy use of
deeply connected neural networks with many layers.
It has ‘naive’ in it because it makes the assumption that each variable in the dataset is independent of
the other. This kind of assumption is unrealistic for real-world data. However, even with this
assumption, it is very useful for solving a range of complicated problems, e.g., spam email classification,
etc.
To learn more about Data Science, check out our Data Science Course in Hyderabad.
A probability sampling strategy called systematic sampling involves picking people from the population
at regular intervals, such as every 15th person on a population list. The population can be organized
randomly to mimic the benefits of simple random sampling.
Provides assistance in calculating the gradient Helps in calculating the gradient using
utilizing the entire set of data. only a single sample.
The volume is substantial enough for analysis. The volume is lower for analysis purposes.
Increase the amount of data in the dataset under study to make it simpler to separate the links
between the input and output variables.
To discover important traits or parameters that need to be examined, use feature selection.
Use regularization strategies to lessen the variation of the outcomes a data model generates.
Rarely, datasets are stabilized by adding a little amount of noisy data. This practice is called data
augmentation.
In order to prevent overfitting and gather knowledge on how the model will generalize to different data
sets, cross-validation aims to establish a data set to test the model during the training phase (i.e.
validation data set).
For example, suppose we are given a box with 10 blue marbles. Then, the entropy of the box is 0 as it
contains marbles of the same color, i.e., there is no impurity. If we need to draw a marble from the box,
the probability of it being blue will be 1.0. However, if we replace 4 of the blue marbles with 4 red
marbles in the box, then the entropy increases to 0.4 for drawing blue marbles.
Additionally, In a decision tree algorithm, multi-class entropy is a measure used to evaluate the impurity
or disorder of a dataset with respect to the class labels when there are multiple classes involved. It is
commonly used as a criterion to make decisions about splitting nodes in a decision tree.
Let’s consider a practical example to gain a better understanding of how information gain operates
within a decision tree algorithm. Imagine we have a dataset containing customer information such as
age, income, and purchase history. Our objective is to predict whether a customer will make a purchase
or not.
To determine which attribute provides the most valuable information, we calculate the information gain
for each attribute. If splitting the data based on income leads to subsets with significantly reduced
entropy, it indicates that income plays a crucial role in predicting purchase behavior. Consequently,
income becomes a crucial factor in constructing the decision tree as it offers valuable insights.
By maximizing information gain, the decision tree algorithm identifies attributes that effectively reduce
uncertainty and enable accurate splits. This process enhances the model’s predictive accuracy,
enabling informed decisions pertaining to customer purchases.
Explore this Data Science Course in Delhi and master decision tree algorithm.
80. From the below given ‘diamonds’ dataset, extract only those rows
where the ‘price’ value is greater than 1000 and the ‘cut’ is ideal.
library(ggplot2)
81. Make a scatter plot between ‘price’ and ‘carat’ using ggplot. ‘Price’
should be on the y-axis, ’carat’ should be on the x-axis, and the ‘color’ of
the points should be determined by ‘cut.’
We will implement the scatter plot using ggplot.
The ggplot is based on the grammar of data visualization, and it helps us stack multiple layers on top of
each other.
So, we will start with the data layer, and on top of the data layer we will stack the aesthetic layer. Finally,
on top of the aesthetic layer we will stack the geometry layer.
Code:
82. Introduce 25 percent missing values in this ‘iris’ dataset and impute
the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with
‘median.’
To introduce missing values, we will be using the missForest package:
library(missForest)
Iris.mis<-prodNA(iris,noNA=0.25)
For imputing the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with ‘median,’ we will
be using the Hmisc package and the impute function:
library(Hmisc)
iris.mis$Sepal.Length<-with(iris.mis, impute(Sepal.Length,mean))
iris.mis$Petal.Length<-with(iris.mis, impute(Petal.Length,median))
Here, we need to find how ‘mpg’ varies w.r.t displacement of the column.
We need to divide this data into the training dataset and the testing dataset so that the model does not
overfit the data.
So, what happens is when we do not divide the dataset into these two components, it overfits the
dataset. Hence, when we add new data, it fails miserably on that new data.
Therefore, to divide this dataset, we would require the caret package. This caret package comprises the
createdatapartition() function. This function will give the true or false labels.
library(caret)
mtcars[split_tag,]->train
mtcars[-split_tag,]->test
lm(mpg-data,data=train)->mod_mtcars
predict(mod_mtcars,newdata=test)->pred_mtcars
>head(pred_mtcars)
Explanation:
Parameters of the createDataPartition function: First is the column which determines the split (it is the
mpg column).
Second is the split ratio which is 0.65, i.e., 65 percent of records will have true labels and 35 percent will
have false labels. We will store this in a split_tag object.
Once we have the split_tag object ready, from this entire mtcars dataframe, we will select all those
records where the split tag value is true and store those records in the training set.
Similarly, from the mtcars dataframe, we will select all those record where the split_tag value is false
and store those records in the test set.
So, the split tag will have true values in it, and when we put ‘-’ symbol in front of it, ‘-split_tag’ will contain
all of the false labels. We will select all those records and store them in the test set.
We will go ahead and build a model on top of the training set, and for the simple linear model we will
require the lm function.
lm(mpg-data,data=train)->mod_mtcars
Now, we have built the model on top of the train set. It’s time to predict the values on top of the test set.
For that, we will use the predict function that takes in two parameters: first is the model which we have
built and the second is the dataframe on which we have to predict values.
Thus, we have to predict values for the test set and then store them in pred_mtcars.
predict(mod_mtcars,newdata=test)->pred_mtcars
Output:
These are the predicted values of mpg for all of these cars.
So, this is how we can build a simple linear model on top of this mtcars dataset.
Code:
cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data
as.data.frame(final_data)->final_data
error<-(final_data$Actual-final_data$Prediction)
cbind(final_data,error)->final_data
sqrt(mean(final_data$error)^2)
Explanation: We have the actual and the predicted values. We will bind both of them into a single
dataframe. For that, we will use the cbind function:
cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data
Our actual values are present in the mpg column from the test set, and our predicted values are stored
in the pred_mtcars object which we have created in the previous question. Hence, we will create this