0% found this document useful (0 votes)
53 views

Data Science Interview Questions

The document discusses top data science interview questions and answers. It covers 100+ questions divided into basic, intermediate, and advanced levels, including questions about data analytics vs data science, Python and R uses, supervised vs unsupervised learning, linear regression, logistic regression, and more.

Uploaded by

kumarravi40402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Data Science Interview Questions

The document discusses top data science interview questions and answers. It covers 100+ questions divided into basic, intermediate, and advanced levels, including questions about data analytics vs data science, Python and R uses, supervised vs unsupervised learning, linear regression, logistic regression, and more.

Uploaded by

kumarravi40402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Courses Free Courses Interview Questions Tutorials Community

Home / Interview Question / Top 100+ Data Science Interview Questions in 2024

Top 100+ Data Science Interview Questions in


2024
By Shlok Pandey 7.4K Views 69 min read Updated on December 28, 2023
Beco me a Cert ified Pro fessio nal
Data Science is one of the hottest jobs today. According to The Economic Times, the job postings
for the Data Science profiles have grown over 400 times over the past year. So, if you want to
start your career as a Data Scientist, here are some top Data Science interview questions and
answers which will help you crack your interview.

Top Data Science Interview Questions And Answers


Categories
Data Science is among the leading and most popular technologies in the world today. Major
Automation 4 organizations are hiring professionals in this field with high salaries due to the increasing demand and
low availability of these professionals. Data scientists are among the highest-paid IT professionals. This
Big Data 12
data science interview preparation blog includes the most frequently asked questions in data science
Business Intelligence 21 job interviews. Here is a list of these popular data science interview questions:

Cloud Computing 21 Q1. What is Data Science?


Q2. Differentiate between Data Analytics and Data Science
Cyber Security 2
Q3. How is Python Useful?

Data Science 11
Q4. How R is Useful in the Data Science Domain?
Q5. What is Supervised Learning?
Database 9 Q6. What is Unsupervised Learning?
Q7. What do you understand about Linear Regression?
Digital Marketing 3
Q8. What do you understand by logistic regression?
Electric Vehicle 1 Q9. What is a confusion matrix?
Q10. What do you understand about the true-positive rate and false-positive rate?
Investment Banking 1

Following are the three categories into which these Data Science interview questions are divided:
Mobile Development 2
1. Basic Level
No-SQL 6
2. Intermediate Level
Programming 19
3. Advanced Level
Project Management 6

Check out this video on the Data Science course:


Salesforce 3

Testing 5

Trending 33

UI UX 2

Website Development 9

Basic Data Science Interview Questions

1. What is Data Science?


Data Science is a field of computer science that explicitly deals with turning data into information and
extracting meaningful insights from it. The reason why Data Science is so popular is that the kind of
insights it allows us to draw from the available data has led to some major innovations in several
products and companies. Using these insights, we are able to determine the taste of a particular
customer, the likelihood of a product succeeding in a particular market, etc.
Check out this comprehensive Data Scientist Certification!

2. Differentiate between Data Analytics and Data Science

Data Analytics Data Science

Data Analytics is a subset of Data Science. Data Science is a broad technology that includes
various subsets such as Data Analytics, Data
Mining, Data Visualization, etc.

The goal of data analytics is to illustrate the The goal of data science is to discover
precise details of retrieved insights. meaningful insights from massive datasets and
derive the best possible solutions to resolve
business issues.

Requires just basic programming languages. Requires knowledge in advanced programming


languages, statistics, and special machine
learning algorithms.

It focuses on just finding the solutions. Data Science not only focuses on finding
solutions but also predicts the future with past
patterns or insights.

A data analyst’s job is to analyze data in order to A data scientist’s job is to provide insightful data
make decisions. visualizations from raw data that are easily
understandable.

Become an expert in Data Scientist. Enroll now in PG program in Data Science and Machine Learning
from MITxMicroMasters

3. How is Python Useful?


Python is widely recognized as an exceptionally advantageous programming language due to its
versatility and simplicity. Its extensive range of applications and associated benefits have established it
as a preferred choice among developers. Notably, Python stands out in terms of readability and user-
friendliness.

Its syntax is meticulously designed to be intuitive and concise, enabling ease in coding, comprehension,
and maintenance. Additionally, Python offers a comprehensive standard library that encompasses a
diverse collection of pre-built modules and functions. This wealth of resources substantially minimizes
the time and effort expended by developers, streamlining the execution of routine programming tasks.

4. How R is Useful in the Data Science Domain?


Here are some ways in which R is useful in the data science domain:

Data Manipulation and Analysis: R offers a comprehensive collection of libraries and functions that
facilitate proficient data manipulation, transformation, and statistical analysis.
Statistical Modeling and Machine Learning: R offers a wide range of packages for advanced
statistical modeling and machine learning tasks, empowering data scientists to build predictive
models and perform complex analyses.
Data Visualization: R’s extensive visualization libraries enable the creation of visually appealing and
insightful plots, charts, and graphs.
Reproducible Research: R supports the integration of code, data, and documentation, facilitating
reproducible workflows and ensuring transparency in data science projects.

5. What is Supervised Learning?


Supervised learning is a machine learning approach in which an algorithm learns from labeled training
data to make predictions or classify new, unseen data. It involves the use of input data and
corresponding output labels, allowing the algorithm to learn patterns and relationships. The goal is to
generalize the learned patterns and accurately predict outputs for new input data based on the learned
patterns.

6. What is Unsupervised Learning?


Unsupervised learning is a machine learning approach wherein an algorithm uncovers patterns and
structures within unlabeled data, operating without explicit guidance or predetermined output labels.
Its objective is to reveal hidden relationships, patterns, and clusters present in the data. Unlike
supervised learning, the algorithm autonomously explores the data to identify inherent structures and
draw inferences, proving valuable for exploratory data analysis and the discovery of novel insights.

Get 100% Hike!


Master Most in Demand Skills Now !

Email Address Phone Number Submit

By providing your contact details, you agree to our Terms of Use & Privacy Policy

7. What do you understand about Linear Regression?


Linear regression helps in understanding the linear relationship between the dependent and the
independent variables. Linear regression is a supervised learning algorithm, which helps in finding the
linear relationship between two variables. One is the predictor or the independent variable and the
other is the response or the dependent variable. In linear regression, we try to understand how the
dependent variable changes with respect to the independent variable. If there is only one independent
variable, then it is called simple linear regression, and if there is more than one independent variable
then it is known as multiple linear regression.

8. What do you understand by logistic regression?


Logistic regression is a classification algorithm that can be used when the dependent variable is binary.
Let’s take an example. Here, we are trying to determine whether it will rain or not on the basis of
temperature and humidity.

Temperature and humidity are the independent variables, and rain would be our dependent variable.
So, the logistic regression algorithm actually produces an S shape curve.

So, basically in logistic regression, the Y value lies within the range of 0 and 1. This is how logistic
regression works.

9. What is a confusion matrix?


The confusion matrix is a table that is used to estimate the performance of a model. It tabulates the
actual values and the predicted values in a 2×2 matrix.
True Positive (d): This denotes all of those records where the actual values are true and the predicted
values are also true. So, these denote all of the true positives. False Negative (c): This denotes all of
those records where the actual values are true, but the predicted values are false. False Positive (b): In
this, the actual values are false, but the predicted values are true. True Negative (a): Here, the actual
values are false and the predicted values are also false. So, if you want to get the correct values, then
correct values would basically represent all of the true positives and the true negatives. This is how the
confusion matrix works.

10. What do you understand about the true-positive rate and false-
positive rate?
True positive rate: In Machine Learning, true-positive rates, which are also referred to as sensitivity or
recall, are used to measure the percentage of actual positives which are correctly identified. Formula:

True Positive Rate = True Positives/Positives

False positive rate: False positive rate is basically the probability of falsely rejecting the null hypothesis
for a particular test. The false-positive rate is calculated as the ratio between the number of negative
events wrongly categorized as positive (false positive) upon the total number of actual events. Formula:

False-Positive Rate = False-Positives/Negatives.

Check out this comprehensive Data Science Course in India!

11. How is Data Science different from traditional application


programming?
Data Science takes a fundamentally different approach to building systems that provide value than
traditional application development.

In traditional programming paradigms, we used to analyze the input, figure out the expected output,
and write code, which contains rules and statements needed to transform the provided input into the
expected output. As we can imagine, these rules were not easy to write, especially, for data that even
computers had a hard time understanding, e.g., images, videos, etc.

Data Science shifts this process a little bit. In it, we need access to large volumes of data that contain
the necessary inputs and their mappings to the expected outputs. Then, we use data science
algorithms, which use mathematical analysis to generate rules to map the given inputs to outputs.

This process of rule generation is called training. After training, we use some data that was set aside
before the training phase to test and check the system’s accuracy. The generated rules are a kind of
black box, and we cannot understand how the inputs are being transformed into outputs.

However, If the accuracy is good enough, then we can use the system (also called a model).

As described above, in traditional programming, we had to write the rules to map the input to the
output, but in Data Science, the rules are automatically generated or learned from the given data. This
helped solve some really difficult challenges that were being faced by several companies.

Interested to learn Data Science skills? Check our Data Science course in Kottayam Now!

12. Explain the differences between supervised and unsupervised


learning.
Supervised and unsupervised learning are two types of Machine Learning techniques. They both allow
us to build models. However, they are used for solving different kinds of problems.

Supervised Learning Unsupervised Learning

Works on the data that contains both inputs Works on the data that contains no mappings
and the expected output, i.e., the labeled data from input to output, i.e., the unlabeled data

Used to create models that can be employed Used to extract meaningful information out of
to predict or classify things large volumes of data

Commonly used supervised learning Commonly used unsupervised learning


algorithms: Linear regression, decision tree, algorithms: K-means clustering, Apriori
etc. algorithm, etc.
Career Transition

13. What is the difference between long format data and wide format
data?

Long Format Data Wide Format Data

A long format data has a column for Whereas, Wide data has a column for each variable.
possible variable types and a column for
the values of those variables.

Each row in the long format represents a The repeated responses of a subject will be in a single
one-time point per subject. As a result, row, with each response in its own column, in the wide
each topic will contain many rows of format.
data.

This data format is most typically used in This data format is most widely used in data
R analysis and for writing to log files at manipulations, and stats programmes for repeated
the end of each experiment. measures ANOVAs and is seldom used in R analysis.

A long format contains values that do A wide format contains values that do not repeat in the
repeat in the first column. first column.

Use df.melt() to convert the wide form to use df.pivot().reset_index() to convert the long form
long form into wide form

14. Mention some techniques used for sampling. What is the main
advantage of sampling?
Sampling is defined as the process of selecting a sample from a group of people or from any particular
kind for research purposes. It is one of the most important factors which decides the accuracy of a
research/survey result.

Mainly, there are two types of sampling techniques:

Probability sampling: It involves random selection which makes every element get a chance to be
selected. Probability sampling has various subtypes in it, as mentioned below:

Simple Random Sampling


Stratified sampling
Systematic sampling
Cluster Sampling
Multi-stage Sampling

Non- Probability Sampling: Non-probability sampling follows non-random selection which means the
selection is done based on your ease or any other required criteria. This helps to collect the data easily.
The following are various types of sampling in it:

Convenience Sampling
Purposive Sampling
Quota Sampling
Referral /Snowball Sampling
15. What is bias in data science?
Bias is a type of error that occurs in a data science model because of using an algorithm that is not
strong enough to capture the underlying patterns or trends that exist in the data. In other words, this
error occurs when the data is too complicated for the algorithm to understand, so it ends up building a
model that makes simple assumptions. This leads to lower accuracy because of underfitting. Algorithms
that can lead to high bias are linear regression, logistic regression, etc.

16. What is dimensionality reduction?


Dimensionality reduction is the process of converting a dataset with a high number of dimensions
(fields) to a dataset with a lower number of dimensions. This is done by dropping some fields or
columns from the dataset. However, this is not done haphazardly. In this process, the dimensions or
fields are dropped only after making sure that the remaining information will still be enough to
succinctly describe similar information.

17. Why is Python used for data cleaning in DS?


Data Scientists have to clean and transform huge data sets into a form that they can work with. It is
important to deal with redundant data for better results by removing nonsensical outliers, malformed
records, missing values, inconsistent formatting, etc.

Python libraries such as Matplotlib, Pandas, Numpy, Keras, and SciPy are extensively used for data clea
ning and analysis. These libraries are used to load and clean the data and do effective analysis. For
instance, you might decide to remove outliers that are beyond a certain standard deviation from the
mean of a numerical column.

mean = df['Price'].mean()

std = df['Price'].std()

threshold = mean + (3 * std) # Set a threshold for outliers

df = df[df['Price'] < threshold] # Remove outliers

Hence, this is how the process of data cleaning is done using python libraries in the field of data
science.

Learn more about Data Cleaning in a Data Science Tutorial!

18. Why is R used in Data Visualization?


R provides the best ecosystem for data analysis and visualization with more than 12,000 packages in
Open-source repositories. It has huge community support, which means you can easily find the
solution to your problems on various platforms like StackOverflow.

It has better data management and supports distributed computing by splitting the operations
between multiple tasks and nodes, which eventually decreases the complexity and execution time of
large datasets.

19. What are the popular libraries used in Data Science?


Below are the popular libraries used for data extraction, cleaning, visualization, and deploying DS
models:

TensorFlow: Supports parallel computing with impeccable library management backed by Google.
SciPy: Mainly used for solving differential equations, multidimensional programming, data
manipulation, and visualization through graphs and charts.
Pandas: Used to implement the ETL(Extracting, Transforming, and Loading the datasets) capabilities
in business applications.
Matplotlib: Being free and open-source, it can be used as a replacement for MATLAB, which results
in better performance and low memory consumption.
PyTorch: Best for projects which involve ,machine learning algorithms and deep neural networks.

Interested to learn more about Data Science, check out our Data Science Course in New York!

Courses you may like


20. What are important functions used in Data Science?
Within the realm of data science, various pivotal functions assume critical roles across diverse tasks.
Among these, two foundational functions are the cost function and the loss function.

Cost function: Also referred to as the objective function, the cost function holds substantial utility
within machine learning algorithms, especially in optimization scenarios. Its purpose is to quantify the
disparity between predicted values and actual values. Minimizing the cost function entails optimizing
the model’s parameters or coefficients, aiming to achieve an optimal solution.

Loss function: Loss functions possess significant significance in supervised learning endeavors. They
evaluate the discrepancy or error between predicted values and actual labels. The selection of a specific
loss function depends on the problem at hand, such as employing mean squared error (MSE) for
regression tasks or cross-entropy loss for classification tasks. The loss function guides the model’s
optimization process during training, ultimately bolstering accuracy and overall performance.

21. What is k-fold cross-validation?


In k-fold cross-validation, we divide the dataset into k equal parts. After this, we loop over the entire
dataset k times. In each iteration of the loop, one of the k parts is used for testing, and the other k − 1
parts are used for training. Using k-fold cross-validation, each one of the k parts of the dataset ends up
being used for training and testing purposes.

22. Explain how a recommender system works.


A recommender system is a system that many consumer-facing, content-driven, online platforms
employ to generate recommendations for users from a library of available content. These systems
generate recommendations based on what they know about the users’ tastes from their activities on
the platform.

For example, imagine that we have a movie streaming platform, similar to Netflix or Amazon Prime. If a
user has previously watched and liked movies from action and horror genres, then it means that the
user likes watching movies of these genres. In that case, it would be better to recommend such movies
to this particular user. These recommendations can also be generated based on what users with
similar tastes like watching.

23. What is Poisson Distribution?


The Poisson distribution is a statistical probability distribution used to represent the occurrence of
events within a specific interval of time or space. It is commonly employed to characterize infrequent
events that happen independently and at a consistent average rate, such as quantifying the number of
incoming phone calls received within a given hour.

Learn how to make sure people type in the right email on your website. It’s easy with JavaScript – read
email validation in JavaScript

24. What is a normal distribution?


Data distribution is a visualization tool to analyze how data is spread out or distributed. Data can be
distributed in various ways. For instance, it could be with a bias to the left or the right, or it could all be
jumbled up.

Data may also be distributed around a central value, i.e., mean, median, etc. This kind of distribution
has no bias either to the left or to the right and is in the form of a bell-shaped curve. This distribution
also has its mean equal to the median. This kind of distribution is called a normal distribution.

25. What is Deep Learning?


Deep Learning is a kind of Machine Learning, in which neural networks are used to imitate the structure
of the human brain, and just like how a brain learns from information, machines are also made to learn
from the information that is provided to them.

Deep Learning is an advanced version of neural networks to make the machines learn from data. In
Deep Learning, the neural networks comprise many hidden layers (which is why it is called ‘deep’
learning) that are connected to each other, and the output of the previous layer is the input of the
current layer.

26. What is CNN (Convolutional Neural Network)?


A Convolutional Neural Network (CNN) is an advanced deep learning architecture designed specifically
for analyzing visual data, such as images and videos. It is composed of interconnected layers of neurons
that utilize convolutional operations to extract meaningful features from the input data. CNNs exhibit
remarkable effectiveness in tasks like image classification, object detection, and image recognition,
thanks to their inherent ability to autonomously learn hierarchical representations and capture spatial
relationships within the data, eliminating the need for explicit feature engineering.

27. What is an RNN (recurrent neural network)?


A recurrent neural network, or RNN for short, is a kind of Machine Learning algorithm that makes use of
the artificial neural network. RNNs are used to find patterns from a sequence of data, such as time
series, stock market, temperature, etc. RNNs are a kind of feedforward network, in which information
from one layer passes to another layer, and each node in the network performs mathematical
operations on the data. These operations are temporal, i.e., RNNs store contextual information about
previous computations in the network. It is called recurrent because it performs the same operations
on some data every time it is passed. However, the output may be different based on past
computations and their results.

28. Explain selection bias.


Selection bias is the bias that occurs during the sampling of data. This kind of bias occurs when a
sample is not representative of the population, which is going to be analyzed in a statistical study.

29. Between Python and R, which one will you choose for analyzing the
text, and why?
Due to the following factors, Python will outperform R for text analytics:

Python’s Pandas module provides high-performance data analysis capabilities as well as simple-to-
use data structures.
Python does all sorts of text analytics more quickly.

30. Explain the purpose of data cleaning


Data cleaning’s primary goal is to rectify or eliminate inaccurate, corrupted, improperly formatted,
duplicate, or incomplete data from a dataset. This often yields better outcomes and a higher return on
investment for marketing and communications efforts.

31. What do you understand from Recommender System? and State its
application
Recommender Systems are a subclass of information filtering systems designed to forecast the
preferences or ratings given to a product by a user.

The Amazon product suggestions page is an example of a recommender system in use. Based on the
user’s search history and previous orders, this area contains products.

32. What is Gradient Descent?


An iterative first-order optimization process called gradient descent (GD) is used to locate the local
minimum and maximum of a given function. This technique is frequently applied in machine learning
(ML) and deep learning (DL) to minimize a cost/loss function (for example, in linear regression).

33. What are the various skills required to become Data Scientist?
The following abilities are necessary to become a certified Data Scientist:
Having familiarity with built-in data types like lists, tuples, sets, and related.
N-dimensional NumPy array knowledge is required.
Being able to use Pandas and Dataframes.
Strong holdover performance in vectors with only one element.
Hands-on experience with Tableau and PowerBI.

34. What is TensorFlow?


A free and open-source software library for machine learning and artificial intelligence is called
TensorFlow. It enables programmers to build dataflow graphs, which are representations of the flow of
data among processing nodes in a graph.

35. What is Dropout?


In Data Science, the term “dropout” refers to the process of randomly removing visible and hidden
network units. By eliminating up to 20% of the nodes, they avoid overfitting the data and allow for the
necessary space to be set up for the network’s iterative convergence process.

36. State any five Deep Learning Frameworks.


Some of the Deep Learning frameworks are:

Caffe
Keras
TensorFlow
Pytorch
Chainer
Microsoft Cognitive Toolkit

37. Define Neural Networks and its types


Neural Networks are computational models that derive their principles from the structure and
functionality of the human brain. Consisting of interconnected artificial neurons organized in layers,
Neural Networks exhibit remarkable capacities in learning and discerning patterns within datasets.
Consequently, they assume a pivotal role in diverse domains including pattern recognition,
classification, and optimization, thereby providing invaluable solutions in the realm of artificial
intelligence.

There exist various types of Neural Networks, including:

Feedforward Neural Networks: These networks facilitate a unidirectional information flow,


progressing from input to output. They find frequent application in tasks involving pattern
recognition and classification.
Convolutional Neural Networks (CNNs): Specifically tailored for grid-like data, such as images or
videos, CNNs leverage convolutional layers to extract meaningful features. Their prowess lies in
tasks like image classification and object detection.
Recurrent Neural Networks (RNNs): RNNs are particularly adept at handling sequential data,
wherein the present output is influenced by past inputs. They are extensively utilized in domains
such as language modeling and time series analysis.
Long Short-Term Memory (LSTM) Networks: This variation of RNNs addresses the issue of
vanishing gradients and excels at capturing long-term dependencies in data. LSTM networks find
wide-ranging applications in areas like speech recognition and natural language processing.
Generative Adversarial Networks (GANs): GANs consist of a generator and a discriminator that is
trained in a competitive manner. They are employed to generate new data samples and are helpful
for tasks like image generation and text synthesis.

These examples represent only a fraction of the available variations and architectures tailored to
specific data types and problem domains.
Intermediate Data Science Interview Questions

38. What is the ROC curve?


It stands for Receiver Operating Characteristic. It is basically a plot between a true positive rate and a
false positive rate, and it helps us to find out the right tradeoff between the true positive rate and the
false positive rate for different probability thresholds of the predicted values. So, the closer the curve to
the upper left corner, the better the model is. In other words, whichever curve has greater area under it
that would be the better model. You can see this in the below graph:

39. What do you understand by a decision tree?


A decision tree is a supervised learning algorithm that is used for both classification and regression.
Hence, in this case, the dependent variable can be both a numerical value and a categorical value.

Here, each node denotes the test on an attribute, and each edge denotes the outcome of that
attribute, and each leaf node holds the class label. So, in this case, we have a series of test conditions
which give the final decision according to the condition.

Are you interested in learning Data Science from experts? Enroll in our Data Science Course in Bangalo
re now!

40. What do you understand by a random forest model?


It combines multiple models together to get the final output or, to be more precise, it combines
multiple decision trees together to get the final output. So, decision trees are the building blocks of the
random forest model.

41. Two candidates, Aman and Mohan appear for a Data Science Job
interview. The probability of Aman cracking the interview is 1/8 and that
of Mohan is 5/12. What is the probability that at least one of them will
crack the interview?
The probability of Aman getting selected for the interview is 1/8
P(A) = 1/8

The probability of Mohan getting selected for the interview is 5/12

P(B)=5/12

Now, the probability of at least one of them getting selected can be denoted at the Union of A and B,
which means

P(A U B) =P(A)+ P(B) – (P(A ∩ B)) ………………………(1)

Where P(A ∩ B) stands for the probability of both Aman and Mohan getting selected for the job.

To calculate the final answer, we first have to find out the value of P(A ∩ B)

So, P(A ∩ B) = P(A) * P(B)

1/8 * 5/12

5/96

Now, put the value of P(A ∩ B) into equation (1)

P(A U B) =P(A)+ P(B) – (P(A ∩ B))

1/8 + 5/12 -5/96

So, the answer will be 47/96.

42. How is Data modeling different from Database design?


Data Modeling: It can be considered as the first step towards the design of a database. Data modeling
creates a conceptual model based on the relationship between various data models. The process
involves moving from the conceptual stage to the logical model to the physical schema. It involves the
systematic method of applying data modeling techniques.

Database Design: This is the process of designing the database. The database design creates an output
which is a detailed data model of the database. Strictly speaking, database design includes the detailed
logical model of a database but it can also include physical design choices and storage parameters.

43. What is precision?


Precision: When we are implementing algorithms for the classification of data or the retrieval of
information, precision helps us get a portion of positive class values that are positively predicted.
Basically, it measures the accuracy of correct positive predictions. Below is the formula to calculate
precision:

44. What is a recall?


Recall: It is the set of all positive predictions out of the total number of positive instances. Recall helps
us identify the misclassified positive predictions. We use the below formula to calculate recall:

45. What is the F1 score and how to calculate it?


F1 score helps us calculate the harmonic mean of precision and recall that gives us the test’s accuracy.
If F1 = 1, then precision and recall are accurate. If F1 < 1 or equal to 0, then precision or recall is less
accurate, or they are completely inaccurate. See below for the formula to calculate the F1 score:

46. What is a p-value?


P-value is the measure of the statistical importance of an observation. It is the probability that shows
the significance of output to the data. We compute the p-value to know the test statistics of a model.
Typically, it helps us choose whether we can accept or reject the null hypothesis.

47. Why do we use p-value?


We use the p-value to understand whether the given data really describes the observed effect or not.
We use the below formula to calculate the p-value for the effect ‘E’ and the null hypothesis ‘H0’ is true:

48. What is the difference between an error and a residual error?


An error occurs in values while the prediction gives us the difference between the observed values and
the true values of a dataset. Whereas, the residual error is the difference between the observed values
and the predicted values. The reason we use the residual error to evaluate the performance of an
algorithm is that the true values are never known. Hence, we use the observed values to measure the
error using residuals. It helps us get an accurate estimate of the error.

49. Why do we use the summary function?


The summary function in R gives us the statistics of the implemented algorithm on a particular dataset.
It consists of various objects, variables, data attributes, etc. It provides summary statistics for individual
objects when fed into the function. We use a summary function when we want information about the
values present in the dataset. It gives us the summary statistics in the following form:

Here, it gives the minimum and maximum values from a specific column of the dataset. Also, it provides
the median, mean, 1st quartile, and 3rd quartile values that help us understand the values better.

50. How are Data Science and Machine Learning related to each other?
Data Science and Machine Learning are two terms that are closely related but are often misunderstood.
Both of them deal with data. However, there are some fundamental distinctions that show us how they
are different from each other.

Data Science is a broad field that deals with large volumes of data and allows us to draw insights from
this voluminous data. The entire process of data science takes care of multiple steps that are involved
in drawing insights out of the available data. This process includes crucial steps such as data gathering,
data analysis, data manipulation, data visualization, etc.

Machine Learning, on the other hand, can be thought of as a sub-field of data science. It also deals with
data, but here, we are solely focused on learning how to convert the processed data into a functional
model, which can be used to map inputs to outputs, e.g., a model that can expect an image as an input
and tell us if that image contains a flower as an output.

In short, data science deals with gathering data, processing it, and finally, drawing insights from it. The
field of data science that deals with building models using algorithms is called machine learning.
Therefore, machine learning is an integral part of data science.

51. Explain univariate, bivariate, and multivariate analyses.


When we are dealing with data analysis, we often come across terms such as univariate, bivariate, and
multivariate. Let’s try and understand what these mean.

Univariate analysis: Univariate analysis involves analyzing data with only one variable or, in other
words, a single column or a vector of the data. This analysis allows us to understand the data and
extract patterns and trends from it. Example: Analyzing the weight of a group of people.
Bivariate analysis: Bivariate analysis involves analyzing the data with exactly two variables or, in
other words, the data can be put into a two-column table. This kind of analysis allows us to figure
out the relationship between the variables. Example: Analyzing the data that contains temperature
and altitude.
Multivariate analysis: Multivariate analysis involves analyzing the data with more than two variables.
The number of columns of the data can be anything more than two. This kind of analysis allows us
to figure out the effects of all other variables (input variables) on a single variable (the output
variable).

Example: Analyzing data about house prices, which contains information about the houses, such as
locality, crime rate, area, the number of floors, etc.

52. How can we handle missing data?


To be able to handle missing data, we first need to know the percentage of data missing in a particular
column so that we can choose an appropriate strategy to handle the situation.

For example, if in a column the majority of the data is missing, then dropping the column is the best
option, unless we have some means to make educated guesses about the missing values. However, if
the amount of missing data is low, then we have several strategies to fill them up.

One way would be to fill them all up with a default value or a value that has the highest frequency in that
column, such as 0 or 1, etc. This may be useful if the majority of the data in that column contains these
values.

Another way is to fill up the missing values in the column with the mean of all the values in that column.
This technique is usually preferred as the missing values have a higher chance of being closer to the
mean than to the mode.

Finally, if we have a huge dataset and a few rows have values missing in some columns, then the easiest
and fastest way is to drop those columns. Since the dataset is large, dropping a few columns should not
be a problem anyway.

53. What is the benefit of dimensionality reduction?


Dimensionality reduction reduces the dimensions and size of the entire dataset. It drops unnecessary
features while retaining the overall information in the data intact. Reduction in dimensions leads to
faster processing of the data.

The reason why data with high dimensions is considered so difficult to deal with is that it leads to high
time consumption while processing the data and training a model on it. Reducing dimensions speeds
up this process, removes noise, and also leads to better model accuracy.

54. What is a bias-variance trade-off in Data Science?


When building a model using Data Science or Machine Learning, our goal is to build one that has low
bias and variance. We know that bias and variance are both errors that occur due to either an overly
simplistic model or an overly complicated model. Therefore, when we are building a model, the goal of
getting high accuracy is only going to be accomplished if we are aware of the tradeoff between bias and
variance.

Bias is an error that occurs when a model is too simple to capture the patterns in a dataset. To reduce
bias, we need to make our model more complex. Although making the model more complex can lead to
reducing bias, if we make the model too complex, it may end up becoming too rigid, leading to high
variance. So, the tradeoff between bias and variance is that if we increase the complexity, the bias
reduces and the variance increases, and if we reduce complexity, the bias increases and the variance
reduces. Our goal is to find a point at which our model is complex enough to give low bias but not so
complex to end up having high variance.

55. What is RMSE?


RMSE stands for the root mean square error. It is a measure of accuracy in regression. RMSE allows us
to calculate the magnitude of error produced by a regression model. The way RMSE is calculated is as
follows:

First, we calculate the errors in the predictions made by the regression model. For this, we calculate the
differences between the actual and the predicted values. Then, we square the errors.
After this step, we calculate the mean of the squared errors, and finally, we take the square root of the
mean of these squared errors. This number is the RMSE and a model with a lower value of RMSE is
considered to produce lower errors, i.e., the model will be more accurate.

56. What is a kernel function in SVM?


In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel
function takes data as input and converts it into a required form. This transformation of the data is
based on something called a kernel trick, which is what gives the kernel function its name. Using the
kernel function, we can transform the data that is not linearly separable (cannot be separated using a
straight line) into one that is linearly separable.

57. How can we select an appropriate value of k in k-means?


Selecting the correct value of k is an important aspect of k-means clustering. We can make use of the
elbow method to pick the appropriate k value. To do this, we run the k-means algorithm on a range of
values, e.g., 1 to 15. For each value of k, we compute an average score. This score is also called inertia or
the inter-cluster variance.

This is calculated as the sum of squares of the distances of all values in a cluster. As k starts from a low
value and goes up to a high value, we start seeing a sharp decrease in the inertia value. After a certain
value of k, in the range, the drop in the inertia value becomes quite small. This is the value of k that we
need to choose for the k-means clustering algorithm.

58. How can we deal with outliers?


Outliers can be dealt with in several ways. One way is to drop them. We can only drop the outliers if they
have values that are incorrect or extreme. For example, if a dataset with the weights of babies has a
value 98.6-degree Fahrenheit, then it is incorrect. Now, if the value is 187 kg, then it is an extreme value,
which is not useful for our model.

In case the outliers are not that extreme, then we can try:

A different kind of model. For example, if we were using a linear model, then we can choose a non-
linear model
Normalizing the data, which will shift the extreme values closer to other data points
Using algorithms that are not so affected by outliers, such as random forest, etc.

59. How to calculate the accuracy of a binary classification algorithm


using its confusion matrix?
In a binary classification algorithm, we have only two labels, which are True and False. Before we can
calculate the accuracy, we need to understand a few key terms:

True positives: Number of observations correctly classified as True


True negatives: Number of observations correctly classified as False
False positives: Number of observations incorrectly classified as True
False negatives: Number of observations incorrectly classified as False

To calculate the accuracy, we need to divide the sum of the correctly classified observations by the
number of total observations.

60. What is ensemble learning?


When we are building models using Data Science and Machine Learning, our goal is to get a model that
can understand the underlying trends in the training data and can make predictions or classifications
with a high level of accuracy.

However, sometimes some datasets are very complex, and it is difficult for one model to be able to
grasp the underlying trends in these datasets. In such situations, we combine several individual models
together to improve performance. This is what is called ensemble learning.

61. Explain collaborative filtering in recommender systems.


Collaborative filtering is a technique used to build recommender systems. In this technique, to generate
recommendations, we make use of data about the likes and dislikes of users similar to other users. This
similarity is estimated based on several varying factors, such as age, gender, locality, etc.
If User A, similar to User B, watched and liked a movie, then that movie will be recommended to User B,
and similarly, if User B watched and liked a movie, then that would be recommended to User A.

In other words, the content of the movie does not matter much. When recommending it to a user what
matters is if other users similar to that particular user liked the content of the movie or not.

62. Explain content-based filtering in recommender systems.


Content-based filtering is one of the techniques used to build recommender systems. In this technique,
recommendations are generated by making use of the properties of the content that a user is
interested in.

For example, if a user is watching movies belonging to the action and mystery genre and giving them
good ratings, it is a clear indication that the user likes movies of this kind. If shown movies of a similar
genre as recommendations, there is a higher probability that the user would like those
recommendations as well.

In other words, here, the content of the movie is taken into consideration when generating
recommendations for users.

63. Explain bagging in Data Science.


Bagging is an ensemble learning method. It stands for bootstrap aggregating. In this technique, we
generate some data using the bootstrap method, in which we use an already existing dataset and
generate multiple samples of the N size. This bootstrapped data is then used to train multiple models in
parallel, which makes the bagging model more robust than a simple model.

Once all the models are trained, then it’s time to make a prediction, we make predictions using all the
trained models and then average the result in the case of regression, and for classification, we choose
the result, generated by models, that have the highest frequency.

64. Explain boosting in data science.


Boosting is one of the ensemble learning methods. Unlike bagging, it is not a technique used to
parallelly train our models. In boosting, we create multiple models and sequentially train them by
combining weak models iteratively in a way that training a new model depends on the models trained
before it.

In doing so, we take the patterns learned by a previous model and test them on a dataset when training
the new model. In each iteration, we give more importance to observations in the dataset that are
incorrectly handled or predicted by previous models. Boosting is useful in reducing bias in models as
well.

65. Explain stacking in data science.


Just like bagging and boosting, stacking is also an ensemble learning method. In bagging and boosting,
we could only combine weak models that used the same learning algorithms, e.g., logistic regression.
These models are called homogeneous learners.

However, in stacking, we can combine weak models that use different learning algorithms as well. These
learners are called heterogeneous learners. Stacking works by training multiple (and different) weak
models or learners and then using them together by training another model, called a meta-model, to
make predictions based on the multiple outputs of predictions returned by these multiple weak
models.

66. Explain how machine learning is different from deep learning.


A field of computer science, machine learning is a subfield of data science that deals with using existing
data to help systems automatically learn new skills to perform different tasks without having rules to be
explicitly programmed.

Deep Learning, on the other hand, is a field in machine learning that deals with building machine
learning models using algorithms that try to imitate the process of how the human brain learns from
the information in a system for it to attain new capabilities. In deep learning, we make heavy use of
deeply connected neural networks with many layers.

67. What does the word ‘Naive’ mean in Naive Bayes?


Naive Bayes is a data science algorithm. It has the word ‘Bayes’ in it because it is based on the Bayes
theorem, which deals with the probability of an event occurring given that another event has already
occurred.

It has ‘naive’ in it because it makes the assumption that each variable in the dataset is independent of
the other. This kind of assumption is unrealistic for real-world data. However, even with this
assumption, it is very useful for solving a range of complicated problems, e.g., spam email classification,
etc.

To learn more about Data Science, check out our Data Science Course in Hyderabad.

68. What is batch normalization?


One method for attempting to enhance the functionality and stability of the neural network is batch
normalization. To do this, normalize the inputs in each layer such that the mean output activation stays
at 0 and the standard deviation is set to 1.

69. What do you understand from cluster sampling and systematic


sampling?
Cluster sampling is also known as the probability sampling approach where you can divide a population
into groups, such as districts or schools, and then select a representative sample from among these
groups at random. A modest representation of the population as a whole should be present in each
cluster.

A probability sampling strategy called systematic sampling involves picking people from the population
at regular intervals, such as every 15th person on a population list. The population can be organized
randomly to mimic the benefits of simple random sampling.

70. What is the Computational Graph?


A directed graph with variables or operations as nodes is a computational graph. Variables can
contribute to operations with their value, and operations can contribute their output to other
operations. In this manner, each node in the graph establishes a function of the variables.

71. What is the difference between Batch and Stochastic Gradient


Descent?
The differences between Batch and Stochastic Gradient Descent are as follows:

Batch Stochastic Gradient Descent

Provides assistance in calculating the gradient Helps in calculating the gradient using
utilizing the entire set of data. only a single sample.

Takes time to converge. Takes less time to converge.

The volume is substantial enough for analysis. The volume is lower for analysis purposes.

Updates the weight infrequently. Updates the weight more frequently.

72. What is an activation function?


An activation function is a function that is incorporated into an artificial neural network to aid in the
network’s learning of complicated patterns in the input data. In contrast to a neuron-based model seen
in human brains, the activation function determines what signals should be sent to the following
neuron at the very end.

73. How Do You Build a random forest model?


The steps for creating a random forest model are as follows:

Choose n from a dataset of k records.


Create distinct decision trees for each of the n data values being taken into account. From each of
them, a projected result is obtained.
Each of the findings is subjected to a voting mechanism.
The final outcome is determined by whose prediction received the most support.
74. Can you avoid overfitting your model? if yes, then how?
In actuality, data models may be overfitting. For it, the strategies listed below can be applied:

Increase the amount of data in the dataset under study to make it simpler to separate the links
between the input and output variables.
To discover important traits or parameters that need to be examined, use feature selection.
Use regularization strategies to lessen the variation of the outcomes a data model generates.
Rarely, datasets are stabilized by adding a little amount of noisy data. This practice is called data
augmentation.

75. What is Cross Validation?


Cross-validation is a model validation method used to assess the generalizability of statistical analysis
results to other data sets. It is frequently applied when forecasting is the main objective and one wants
to gauge how well a model will work in real-world applications.

In order to prevent overfitting and gather knowledge on how the model will generalize to different data
sets, cross-validation aims to establish a data set to test the model during the training phase (i.e.
validation data set).

76. What is variance in Data Science?


Variance is a type of error that occurs in a Data Science model when the model ends up being too
complex and learns features from data, along with the noise that exists in it. This kind of error can
occur if the algorithm used to train the model has high complexity, even though the data and the
underlying patterns and trends are quite easy to discover. This makes the model a very sensitive one
that performs well on the training dataset but poorly on the testing dataset, and on any kind of data
that the model has not yet seen. Variance generally leads to poor accuracy in testing and results in
overfitting.

77. What is pruning in a decision tree algorithm?


Pruning a decision tree is the process of removing the sections of the tree that are not necessary or are
redundant. Pruning leads to a smaller decision tree, which performs better and gives higher accuracy
and speed.

78. What is entropy in a decision tree algorithm?


In a decision tree algorithm, entropy is the measure of impurity or randomness. The entropy of a given
dataset tells us how pure or impure the values of the dataset are. In simple terms, it tells us about the
variance in the dataset.

Entropy(D) = - p * log2(p) - (1 - p) * log2(1 - p)


where:
Entropy(D) represents the entropy of the dataset D
p represents the proportion of positive class instances in D
log2 represents the logarithm to the base 2.

For example, suppose we are given a box with 10 blue marbles. Then, the entropy of the box is 0 as it
contains marbles of the same color, i.e., there is no impurity. If we need to draw a marble from the box,
the probability of it being blue will be 1.0. However, if we replace 4 of the blue marbles with 4 red
marbles in the box, then the entropy increases to 0.4 for drawing blue marbles.

Additionally, In a decision tree algorithm, multi-class entropy is a measure used to evaluate the impurity
or disorder of a dataset with respect to the class labels when there are multiple classes involved. It is
commonly used as a criterion to make decisions about splitting nodes in a decision tree.

79. What information is gained in a decision tree algorithm?


When building a decision tree, at each step, we have to create a node that decides which feature we
should use to split data, i.e., which feature would best separate our data so that we can make
predictions. This decision is made using information gain, which is a measure of how much entropy is
reduced when a particular feature is used to split the data. The feature that gives the highest
information gain is the one that is chosen to split the data.

Let’s consider a practical example to gain a better understanding of how information gain operates
within a decision tree algorithm. Imagine we have a dataset containing customer information such as
age, income, and purchase history. Our objective is to predict whether a customer will make a purchase
or not.

To determine which attribute provides the most valuable information, we calculate the information gain
for each attribute. If splitting the data based on income leads to subsets with significantly reduced
entropy, it indicates that income plays a crucial role in predicting purchase behavior. Consequently,
income becomes a crucial factor in constructing the decision tree as it offers valuable insights.

By maximizing information gain, the decision tree algorithm identifies attributes that effectively reduce
uncertainty and enable accurate splits. This process enhances the model’s predictive accuracy,
enabling informed decisions pertaining to customer purchases.

Explore this Data Science Course in Delhi and master decision tree algorithm.

Advanced Data Science Interview Questions

80. From the below given ‘diamonds’ dataset, extract only those rows
where the ‘price’ value is greater than 1000 and the ‘cut’ is ideal.

First, we will load the ggplot2 package:

library(ggplot2)

Next, we will use the dplyr package:

library(dplyr)// It is based on the grammar of data manipulation.

To extract those particular records, use the below command:

diamonds %>% filter(price>1000 & cut==”Ideal”)-> diamonds_1000_idea

81. Make a scatter plot between ‘price’ and ‘carat’ using ggplot. ‘Price’
should be on the y-axis, ’carat’ should be on the x-axis, and the ‘color’ of
the points should be determined by ‘cut.’
We will implement the scatter plot using ggplot.

The ggplot is based on the grammar of data visualization, and it helps us stack multiple layers on top of
each other.

So, we will start with the data layer, and on top of the data layer we will stack the aesthetic layer. Finally,
on top of the aesthetic layer we will stack the geometry layer.

Code:

>ggplot(data=diamonds, aes(x=caret, y=price, col=cut))+geom_point()

82. Introduce 25 percent missing values in this ‘iris’ dataset and impute
the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with
‘median.’
To introduce missing values, we will be using the missForest package:

library(missForest)

Using the prodNA function, we will be introducing 25 percent of missing values:

Iris.mis<-prodNA(iris,noNA=0.25)

For imputing the ‘Sepal.Length’ column with ‘mean’ and the ‘Petal.Length’ column with ‘median,’ we will
be using the Hmisc package and the impute function:

library(Hmisc)
iris.mis$Sepal.Length<-with(iris.mis, impute(Sepal.Length,mean))
iris.mis$Petal.Length<-with(iris.mis, impute(Petal.Length,median))

83. Implement simple linear regression in R on this ‘mtcars’ dataset,


where the dependent variable is ‘mpg’ and the independent variable is
‘disp.’

Here, we need to find how ‘mpg’ varies w.r.t displacement of the column.

We need to divide this data into the training dataset and the testing dataset so that the model does not
overfit the data.

So, what happens is when we do not divide the dataset into these two components, it overfits the
dataset. Hence, when we add new data, it fails miserably on that new data.

Therefore, to divide this dataset, we would require the caret package. This caret package comprises the
createdatapartition() function. This function will give the true or false labels.

Here, we will use the following code:

library(caret)

split_tag<-createDataPartition(mtcars$mpg, p=0.65, list=F)

mtcars[split_tag,]->train

mtcars[-split_tag,]->test

lm(mpg-data,data=train)->mod_mtcars

predict(mod_mtcars,newdata=test)->pred_mtcars

>head(pred_mtcars)

Explanation:
Parameters of the createDataPartition function: First is the column which determines the split (it is the
mpg column).

Second is the split ratio which is 0.65, i.e., 65 percent of records will have true labels and 35 percent will
have false labels. We will store this in a split_tag object.

Once we have the split_tag object ready, from this entire mtcars dataframe, we will select all those
records where the split tag value is true and store those records in the training set.

Similarly, from the mtcars dataframe, we will select all those record where the split_tag value is false
and store those records in the test set.

So, the split tag will have true values in it, and when we put ‘-’ symbol in front of it, ‘-split_tag’ will contain
all of the false labels. We will select all those records and store them in the test set.

We will go ahead and build a model on top of the training set, and for the simple linear model we will
require the lm function.

lm(mpg-data,data=train)->mod_mtcars

Now, we have built the model on top of the train set. It’s time to predict the values on top of the test set.
For that, we will use the predict function that takes in two parameters: first is the model which we have
built and the second is the dataframe on which we have to predict values.

Thus, we have to predict values for the test set and then store them in pred_mtcars.

predict(mod_mtcars,newdata=test)->pred_mtcars

Output:

These are the predicted values of mpg for all of these cars.

So, this is how we can build a simple linear model on top of this mtcars dataset.

84. Calculate the RMSE values for the model building.


When we build a regression model, it predicts certain y values associated with the given x values, but
there is always an error associated with this prediction. So, to get an estimate of the average error in
prediction, RMSE is used.

Code:

cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data

as.data.frame(final_data)->final_data

error<-(final_data$Actual-final_data$Prediction)

cbind(final_data,error)->final_data

sqrt(mean(final_data$error)^2)

Explanation: We have the actual and the predicted values. We will bind both of them into a single
dataframe. For that, we will use the cbind function:

cbind(Actual=test$mpg, predicted=pred_mtcars)->final_data

Our actual values are present in the mpg column from the test set, and our predicted values are stored
in the pred_mtcars object which we have created in the previous question. Hence, we will create this

You might also like