55 Questions
55 Questions
“Data is a precious thing and will last longer than the systems themselves.” – Tim Berners-Lee,
inventor of the World Wide Web.
We live in an information-driven age where data plays an integral role in the functioning of any
organization. Thus, organizations are always on the lookout for skilled data analysts who can
turn their data into valuable information. This can help organizations achieve better business
growth as they can better understand the market, consumers, their products or services, and
much more.
Technology professionals with data analytics skills are finding themselves in high demand as
businesses look to harness data power. If you are planning to be a part of this high potential
industry and prepare for your next data analytics interview, you are in the right place!
Here are the top 55 data analytics questions & answers that will help you clear your next data
analytics interview. These questions cover all the essential topics, ranging from data cleaning
and data validation to SAS.
Let’s begin!
Here are some of the top data analytics interview questions & answers:
Make a data cleaning plan by understanding where the common errors take place and
keep communications open.
Standardize the data at the point of entry. This way it is less chaotic and you will be able
to ensure that all information is standardized, leading to fewer errors on entry.
Focus on the accuracy of the data. Maintain the value types of data, provide mandatory
constraints, and set cross-field validation.
Identify and remove duplicates before working with the data. This will lead to an
effective data analysis process.
Create a set of utility tools/functions/scripts to handle common data cleaning tasks.
Q2. What are the challenges that are faced as a data analyst?
Ans. There are various ways you can answer the question. It might be very badly formatted
data when the data isn’t enough to work with, clients provide data they have supposedly
cleaned it but it has been made worse, not getting updated data or there might be factual/data
entry errors.
Q3. What are the data validation methods used in data analytics?
Field Level Validation – validation is done in each field as the user enters the data to
avoid errors caused by human interaction.
Form Level Validation – In this method, validation is done once the user completes the
form before a save of the information is needed.
Data Saving Validation – This type of validation is performed during the saving process
of the actual file or database record. This is usually done when there are multiple data
entry forms.
Search Criteria Validation – This type of validation is relevant to the user to match what
the user is looking for to a certain degree. It is to ensure that the results are actually
returned.
Ans. Any observation that lies at an abnormal distance from other observations is known as an
outlier. It indicates either variability in the measurement or an experimental error.
Q5. What is the difference between data mining and data profiling?
Ans. Data profiling is usually done to assess a dataset for its uniqueness, consistency, and logic.
It cannot identify incorrect or inaccurate data values.
Data mining is the process of finding relevant information that has not been found before. It is
the way in which raw data is turned into valuable information.
Ans. A good data analyst would be able to understand the market dynamics and act accordingly
to retain a working data model so as to adjust to the new environment.
Ans. KNN (K-nearest neighbor) is an algorithm that is used for matching a point with its closest
k neighbors in a multi-dimensional space.
Ans. KNN is used for missing values under the assumption that a point value can be
approximated by the values of the points that are closest to it, based on other variables.
Ans. Kmeans algorithm partitions a data set into clusters such that a cluster formed is
homogeneous and the points in each cluster are close to each other. The algorithm tries to
maintain enough separation between these clusters. Due to the unsupervised nature, the
clusters have no labels.
Also Read>> Top Data Analytics Courses from Coursera, Edx, WileyNXT, and Jigsaw
Q11. What is the difference between the true positive rate and recall?
Ans. There is no difference, they are the same, with the formula:
Q12. What is the difference between linear regression and logistic regression?
Ans. The differences between linear regression and logistic regression are:
It is intuitive.
Its data can be easily consumed.
The data changes in it are scalable.
It can evolve and support new business cases.
Q14. Estimate the number of weddings that take place in a year in India?
Ans. To answer this type of guesstimation questions, one should always follow four steps:
Step 1:
Start with the right proxy – here the right proxy will be the total population. You know that
India has more than 1 billion population and to be a bit more precise, it’s around 1.2 billion.
Step 2:
Segment and filter – the next step is to find the right segments and filter out the ones which are
not. You will have a tree-like structure, with branches for each segment and sub-branches
which filters out each segment further. In this question, we will filter out the population above
35 years of age and below 15 for rural/below 20 for urban.
Step 3:
Always round of the proxy to one or zero decimal points so that your calculation is easy. Instead
of doing a calculation like 1488/5, you can go for 1500/5.
Step 4:
Validate each number using your common sense to understand if it’s the right one. Add all the
numbers that you have come up after filtering. You will get the required guesstimate. E.g. we
will validate the guesstimate to include one-time marriages only at the end.
Let’s do it:
Percentage of the population which has the probability of getting married in the rural area ≈
(35-15)/35*65 ≈ 40%
Percentage of the population which has the probability of getting married in the urban area ≈
(35-20)/35*65 ≈ 30%
Considering only first-time marriages in the rural area ≈ 170 million/20 ≈ 8.5 million
Ans. The T-test is usually used when we have a sample size of less than 30 and a z-test when we
have a sample test greater than 30.
Q16. What are the two main methods two detect outliers?
Ans. Box plot method: if the value is higher or lesser than 1.5*IQR (interquartile range) above
the upper quartile (Q3) or below the lower quartile (Q1) respectively, then it is considered an
outlier.
Standard deviation method: if the value higher or lower than mean ± (3*standard deviation),
then it is considered an outlier.
Ans. The standardized coefficient is interpreted in terms of standard deviation while the
unstandardized coefficient is measured in actual values.
Ans. R-squared measures the proportion of variation in the dependent variables explained by
the independent variables.
Adjusted R-squared gives the percentage of variation explained by those independent variables
that in reality affect the dependent variable.
Q20. What is the difference between factor analysis and principal component analysis?
Ans. The aim of the principal component analysis is to explain the covariance between variables
while the aim of factor analysis is to explain the variance between variables.
Ans. Since data preparation is a critical approach to data analytics, the interviewer might be
interested in knowing what path you will take up to clean and transform raw data before
processing and analysis. As an answer to this data analytics interview question, you should
discuss the model you will be using, along with logical reasoning for it. In addition, you should
also discuss how your steps would help you to ensure superior scalability and accelerated data
usage.
Q23. What are some of the most popular tools used in data analytics?
Tableau
Google Fusion Tables
Google Search Operators
Konstanz Information Miner (KNIME)
RapidMiner
Solver
OpenRefine
NodeXL
Io
Pentaho
SQL Server Reporting Services (SSRS)
Microsoft data management stack
Q24. What are the most popular statistical methods used when analyzing data?
Ans. The most popular statistical methods used in data analytics are –
Linear Regression
Classification
Resampling Methods
Subset Selection
Shrinkage
Dimension Reduction
Nonlinear Models
Tree-Based Methods
Support Vector Machines
Unsupervised Learning
Q27. Do you have any idea about the job profile of a data analyst?
Ans. Yes, I have a fair idea of the job responsibilities of a data analyst. Their primary
responsibilities are –
To work in collaboration with IT, management, and/or data scientist teams to determine
organizational goals
Dig data from primary and secondary sources
Clean the data and discard irrelevant information
Perform data analysis and interpret results using standard statistical methodologies
Highlight changing trends, correlations, and patterns in complicated data sets
Strategize process improvement
Ensure clear data visualizations for management
Ans. A Pivot Table is a Microsoft Excel feature used to summarize huge datasets quickly. It
sorts, reorganizes, counts, or groups data stored in a database. This data summarization
includes sums, averages, or other statistics.
Ans. Standard deviation is a very popular method to measure any degree of variation in a data
set. It measures the average spread of data around the mean most accurately.
Ans. A data collection plan is used to collect all the critical data in a system. It covers –
Ans. An Affinity Diagram is an analytical tool used to cluster or organize data into subgroups
based on their relationships. These data or ideas are mostly generating from discussions or
brainstorming sessions, and are used in analyzing complex issues.
Ans. Missing data may lead to some critical issues; hence, imputation is the methodology that
can help to avoid pitfalls. It is the process of replacing missing data with substituted values.
Imputation helps in preventing list-wise deletion of cases with missing values.
Q34. Name some of the essential tools useful for Big Data analytics.
NodeXL
KNIME
Tableau
Solver
OpenRefine
Rattle GUI
Qlikview
Ans. Truth Table is a collection of facts, determining the truth or falsity of a proposition. It
works as a complete theorem-prover and is of three types –
Ans. In simpler terms, data visualization is a graphical representation of information and data. It
enables the users to view and analyze data in a smarter way and use technology to draw them
into diagrams and charts.
Ans. Since it is easier to view and understand complex data in the form of charts or graphs, the
trend of data visualization has picked up rapidly.
Ans. Metadata refers to the detailed information about the data system and its contents. It
helps to define the type of data or information that will be sorted.
Ans. Overfitting – In overfitting, a statistical model describes any random error or noise, and
occurs when a model is super complicated. An overfit model has poor predictive performance
as it overreacts to minor fluctuations in training data.
Underfitting – In underfitting, a statistical model is unable to capture the underlying data trend.
This type of model also shows poor predictive performance.
Bokeh
Matplotlib
NumPy
Pandas
SciKit
SciPy
Seaborn
TensorFlow
Keras
Ans. There are two steps involved in the data validation process, namely, data screening and
data verification.
Data Screening: In this step, different algorithms are used to screen the entire data to find any
inaccurate values. It is the process of ensuring that the data is clean and ready for analysis.
Data Verification: In the data verification step, the accuracy and quality of source data are
checked before using it. Every suspected value is evaluated on various use-cases, and then a
final decision is taken on whether the value has to be included in the data or not. Data
validation is a form of data cleansing.
Q42. Mention some problems that data analysts face while performing the analysis?
Ans. The problems that data analysts face while performing data analysis are:
Q44. Explain KPI, the design of experiments, and the 80/20 rule.
Ans. KPI stands for Key Performance Indicator. It is a metric or a feature consisting of any
combination of spreadsheets, reports, or charts about the business process.
Also known as experimental design, the design of experiments is the initial process that is used
before data is collected. It is used to split the data, sample, and set up a data set for statistical
analysis.
The 80/20 rule means that 80 percent of your income (or results) comes from 20 percent of
your clients (or efforts).
Ans. Hadoop Ecosystem is the framework developed by Apache. It processes large datasets for
an application in a distributed computing environment. It consists of the following Hadoop
components.
HDFS
YARN
MapReduce
Spark
PIG
HIVE
HBase
Oozie
Mahout
Spark MLlib
Apache Drill
Zookeeper
Flume
Sqoop
Ambari
Solr
Lucene
Q46. What is MapReduce?
Ans. MapReduce is a framework that enables you to write applications to process large data
sets, splitting them into subsets, processing each subset on a different server, and then
blending results obtained on each. It consists of two tasks, namely Map and Reduce. The map
performs filtering and sorting while reduce performs a summary operation. As the name
suggests, the Reduce process always takes place after the map task.
Ans. Clustering is a classification method that is applied to data. Clustering or cluster analysis is
the process of grouping a set of objects in such a manner that the objects in the same cluster
are more similar to each other than to those in other clusters.
Hierarchical or flat
Hard and soft
Iterative
Disjunctive
Ans. Time series analysis is a statistical technique that analyzes time-series data to extract
meaningful statistics and other characteristics of the data. There are two ways to do it, namely
the frequency domain and the time domain. Various methods like exponential smoothening
and log-linear regression methods help in forecasting the output of a particular process by
analyzing the previous data.
Ans. Hash Table is a data structure that stores data in an associative manner. It is a map of keys
to values. It stores the data in an array format where each data value has its unique index
value. A hash table uses a hash technique to generate an index into an array of slots, from
which the desired value can be fetched.
Q50. What are collisions in hash tables? How to deal with them?
Ans. A hash table collision occurs when two different keys are hashed to the same index in a
hash table. In simple terms, it happens when two different keys hash to the same value.
Collisions, thus, create a problem as two elements cannot be stored in the same slot in an
array.
2. Open Addressing
Ans. Imputation is the process of replacing the missing data with substituted values. While
there are many ways to approach missing data, the most common imputation techniques are:
Single Imputation: In this, you find a single estimate of the missing value. The following are the
single imputation techniques:
Mean imputation: Replace the missing value with the mean of that variable for all other cases.
Hot deck imputation: Identify all the sample subjects who are similar on other variables, then
randomly choose one of their values on the missing variable.
Cold deck imputation: It works just like the hot deck but in a systematic manner. A
systematically chosen value from an individual who has similar values on other variables.
Regression imputation: The predicted value obtained by regressing the missing variable on
other variables.
Stochastic regression: It works like the regression imputation and adds the average regression
variance to regression imputation.
Substitution: Impute the value from a new variable that was not selected to be in the sample.
Multiple Imputation: In the Multiple Imputation technique, the values are estimated multiple
times.
Ans. An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It
is the combination of adjacent words or letters of length n that are in the source text. It is a
type of probabilistic language model that predicts the next item in such a sequence in the form
of a (n-1).
Ans. The ANYDIGIT function searches a string for the first occurrence of any character that is a
digit. It gives the first position of a digit from a character string after searching for any digit in
the character variable. If the character is there, the ANYDIGIT function will return the position
in the string of that character. If there is no such character, then it will return a value of 0.
Ans. In SAS, Interleaving means combining individual sorted SAS data sets into one big sorted
data set. Data sets can be interleaved by using a SET statement and a BY statement.
We hope you found this Data Analytics interview questions & answers article useful. The
questions covered in this post are the most sought-after data analytics interview questions that
will help you ace your next interview!