Top 80+ Data Analyst Interview Questions and Answers (2024)
Top 80+ Data Analyst Interview Questions and Answers (2024)
Answers
Last Updated : 16 Jun, 2024
In the 21st century, data holds immense value, making data analysis a
lucrative career choice. If you’re considering a career in data analysis but
are worried about interview questions, you’ve come to the right place. This
article presents the top 85 data analyst interview questions and answers
to help you prepare for your interview. Let’s dive into these questions to
equip you for success in the interview process
Table of Content
Data Analyst Interview Questions for Freshers
Statistics Interview Questions and Answers for Data Analyst
SQL Interview Questions for Data Analysts
Data Visualizations or BI tools Data Analyst Interview questions
▲
What is Data Analyst?
Data analysts is a person that uses statistical methods, programming,
and visualization tools to analyze and interpret data, helping organizations
make informed decisions. They clean, process, and organize data to
identify trends, patterns, and anomalies, contributing crucial insights that
drive strategic and operational decision-making within businesses and
other sectors.
Data analysts are responsible for collecting, cleaning, and analyzing data
to help businesses make better decisions. They typically use statistical
analysis and visualization tools to identify trends and patterns in data.
Data analysts may also develop reports and dashboards to communicate
their findings to stakeholders.
Data scientists are responsible for creating and implementing machine
learning and statistical models on data. These models are used to make
predictions, automate jobs, and enhance business processes. Data
scientists are also well-versed in programming languages and software
engineering.
Data analysis and Business intelligence are both closely related fields,
Both use data and make analysis to make better and more effective
decisions. However, there are some key differences between the two.
The similarities and differences between the Data Analysis and Business
Intelligence are as follows:
Similarities Differences
4. What are the different tools mainly used for data analysis?
There are different tools used for data analysis. each has some strengths
and weaknesses. Some of the most commonly used tools for data
analysis are as follows:
Descriptive and predictive analysis are the two different ways to analyze
the data.
Univariate, Bivariate and multivariate are the three different levels of data
analysis that are used to understand the data.
8. Name some of the most popular data analysis and visualization tools
used for data analysis.
Some of the most popular data analysis and visualization tools are as
follows:
EDA provides the groundwork for the entire data analysis process. It
enables analysts to make more informed judgments about data
processing, hypothesis testing, modelling, and interpretation, resulting in
more accurate and relevant insights.
Min-Max Scaling: Scales the data to a range between 0 and 1 using the
formula:
(x – min) / (max – min)
Z-Score Normalization (Standardization): Scales data to have a mean
of 0 and a standard deviation of 1 using the formula:
(x – mean) / standard_deviation
Robust Scaling: Scales data by removing the median and scaling to the
interquartile range(IQR) to handle outliers using the formula:
(X – Median) / IQR
Unit Vector Scaling: Scales each data point to have a Euclidean norm
(length) (||X||) of 1 using the formula:
X / ||X||
15. What are the main libraries you would use for data analysis in
Python?
For data analysis in Python, many great libraries are used due to their
versatility, functionality, and ease of use. Some of the most common
libraries are as follows:
Structured and unstructured data depend on the format in which the data
is stored. Structured data is information that has been structured in a
certain format, such as a table or spreadsheet. This facilitates searching,
sorting, and analyzing. Unstructured data is information that is not
arranged in a certain format. This makes searching, sorting, and analyzing
more complex.
The key differences between the pandas Series and Dataframes are as
follows:
Boxplot
Boxplot is used for detection the outliers in the dataset by visualizing the
distribution of data.
Range: The range is the difference between the highest and lowest
values in a data set. It gives an idea of how much the data spreads
from the minimum to the maximum.
Variance: The variance is the average of the squared deviations of each
data point from the mean. It is a measure of how spread out the data is
around the mean.
∑(X−μ)2
Variance(σ 2 ) = N
Standard Deviation: The standard deviation is the square root of the
variance. It is a measure of how spread out the data is around the
mean, but it is expressed in the same units as the data itself.
Mean Absolute Deviation (MAD): MAD is the average of the absolute
differences between each data point and the mean. Unlike variance, it
doesn’t involve squaring the differences, making it less sensitive to
extreme values. it is less sensitive to outliers than the variance or
standard deviation.
Percentiles: Percentiles are statistical values that measure the relative
positions of values within a dataset. Which is computed by arranging
the dataset in descending order from least to the largest and then
dividing it into 100 equal parts. In other words, a percentile tells you
what percentage of data points are below or equal to a specific value.
Percentiles are often used to understand the distribution of data and to
identify values that are above or below a certain threshold within a
dataset.
Interquartile Range (IQR): The interquartile range (IQR) is the range of
values ranging from the 25th percentile (first quartile) to the 75th
percentile (third quartile). It measures the spread of the middle 50% of
the data and is less affected by outliers.
Coefficient of Variation (CV): The coefficient of variation (CV) is a
measure of relative variability, It is the ratio of the standard deviation to
the mean, expressed as a percentage. It’s used to compare the relative
variability between datasets with different units or scales.
In statistics, the null and alternate hypotheses are two mutually exclusive
statements regarding a population parameter. A hypothesis test analyzes
sample data to determine whether to accept or reject the null hypothesis.
Both null and alternate hypotheses represent the opposing statements or
claims about a population or a phenomenon under investigation.
If the p-value is less than the significance level, we reject the null
hypothesis and conclude that there is a statistically significant difference
between the groups.
If p-value ≤ α: Reject the null hypothesis. This indicates that the results
are statistically significant, and there is evidence to support the
alternative hypothesis.
If p-value > α: Fail to reject the null hypothesis. This means that the
results are not statistically significant, and there is insufficient
evidence to support the alternative hypothesis.
Type I error (False Positive, α): Type I error occurs when the null
hypothesis is rejected when it is true. This is also referred as a false
positive. The probability of committing a Type I error is denoted by α
(alpha) and is also known as the significance level. A lower
significance level (e.g., = 0.05) reduces the chance of Type I mistakes
while increasing the risk of Type II errors.
For example, a Type I error would occur if we estimated that a new
medicine was successful when it was not.
Type I Error (False Positive, α): Rejecting a true null
hypothesis.
Type II Error (False Negative, β): Type II error occurs when a researcher
fails to reject the null hypothesis when it is actually false. This is also
referred as a false negative. The probability of committing a Type II
error is denoted by β (beta)
For example, a Type II error would occur if we estimated that a new
medicine was not effective when it is actually effective.
Type II Error (False Negative, β): Failing to reject a false null
hypothesis.
For example, A 95% confidence interval indicates that you are 95%
confident that the real population parameter falls inside the interval. A
95% confidence interval for the population mean (μ) can be expressed as :
ˉ–Margin of error, x
(x ˉ + Margin of error)
where x̄ is the point estimate (sample mean), and the margin of error is
calculated using the standard deviation of the sample and the confidence
level.
ANOVA works by partitioning the total variance in the data into two
components:
Positive correlation (r > 0): As one variable increases, the other tends
to increase. The greater the positive correlation, the closer “r” is to +1.
Negative correlation (r < 0): As one variable rises, the other tends to
fall. The closer “r” is to -1, the greater the negative correlation.
No correlation (r = 0): There is little or no linear relationship between
the variables.
34. What are the differences between Z-test, T-test and F-test?
The Z-test, t-test, and F-test are statistical hypothesis tests that are
employed in a variety of contexts and for a variety of objectives.
The key differences between the Z-test, T-test, and F-test are as follows:
Assumptions 1. Population
follows a
normal
1. The variances of
distribution
the populations
or the
from which the
1. Population sample size
samples are
follows a is large
drawn should be
normal enough for
equal
distribution. the Central
(homoscedastic).
2. Population Limit
2. Populations
standard Theorem to
being compared
deviation is apply.
have normal
known 2. Also applied
distributions and
when the
that the samples
population
are independent.
standard
deviation is
unknown.
Data N<30 or
population
Used to test the
N>30 standard
variances
deviation is
unknown.
Y = β0 + β1 X + ϵ
Where:
CREATE
It is used to create the table and insert the values in the database. The
commands used to create the table are as follows:
READ
UPDATE
UPDATE employees
SET salary = 55000
WHERE last_name = 'Gunjan';
DELETE
38. What is the SQL statement used to insert new records into a table?
We use the ‘INSERT‘ statement to insert new records into a table. The
‘INSERT INTO’ statement in SQL is used to add new records (rows) to a
table.
Syntax
Example
39. How do you filter records using the WHERE clause in SQL?
Syntax
40. How can you sort records in ascending or descending order using
SQL?
The purpose of GROUP BY clause in SQL is to group rows that have the
same values in specified columns. It is used to arrange different rows in a
group if a particular column has the same values with the help of some
functions.
Syntax
Example: This SQL query groups the ‘CUSTOMER’ table based on age by
using GROUP BY
42. How do you perform aggregate functions like SUM, COUNT, AVG, and
MAX/MIN in SQL?
SELECT SUM(Cost)
FROM Products;
COUNT: It counts the number of rows in a result set or the number of non-
null values in a column.
SELECT COUNT(*)
FROM Orders;
SELECT AVG(Price)
FROM Products;
SELECT MAX(Price)
FROM Orders;
MIN: It returns the minimum value in a column.
SELECT MIN(Price)
FROM Products;
AI ML DS Data Science Data Analysis Data Visualization Machine Learning Deep Learning N
43. What is an SQL join operation? Explain different types of joins
(INNER, LEFT, RIGHT, FULL).
SQL Join operation is used to combine data or rows from two or more
tables based on a common field between them. The primary purpose of a
join is to retrieve data from multiple tables by linking records that have a
related value in a specified column. There are different types of join i.e,
INNER, LEFT, RIGHT, FULL. These are as follows:
INNER JOIN: The INNER JOIN keyword selects all rows from both tables
as long as the condition is satisfied. This keyword will create the result-
set by combining all rows from both the tables where the condition
satisfies i.e the value of the common field will be the same.
Example:
LEFT JOIN: A LEFT JOIN returns all rows from the left table and the
matching rows from the right table.
Example:
SELECT departments.department_name, employees.first_name
FROM departments
LEFT JOIN employees
ON departments.department_id = employees.department_id;
RIGHT JOIN: RIGHT JOIN is similar to LEFT JOIN. This join returns all the
rows of the table on the right side of the join and matching rows for the
table on the left side of the join.
Example:
FULL JOIN: FULL JOIN creates the result set by combining the results of
both LEFT JOIN and RIGHT JOIN. The result set will contain all the rows
from both tables.
Example:
44. How can you write an SQL query to retrieve data from multiple
related tables?
45. What is a subquery in SQL? How can you use it to retrieve specific
data?
SELECT customer_name,
(SELECT COUNT(*) FROM orders WHERE orders.customer_id =
customers.customer_id) AS order_count
FROM customers;
47. What is the purpose of the HAVING clause in SQL? How is it different
from the WHERE clause?
The HAVING clause is used to filter The WHERE clause is used to filter
groups of rows after grouping. It rows before grouping. It operates on
operates on the results of aggregate individual rows in the table and is
functions applied to grouped applied before grouping and
columns. aggregation.
The HAVING clause is typically used The WHERE clause can be used with
with GROUP BY queries. It filters any SQL query, whether it involves
groups of rows based on conditions grouping or not. It filters individual
involving aggregated values. rows based on specified conditions.
Command: Command:
48. How do you use the UNION and UNION ALL operators in SQL?
In SQL, the UNION and UNION ALL operators are used to combine the
result sets of multiple SELECT statements into a single result set. These
operators allow you to retrieve data from multiple tables or queries and
present it as a unified result. However, there are differences between the
two operators:
1. UNION Operator:
The UNION operator returns only distinct rows from the combined result
sets. It removes duplicate rows and returns a unique set of rows. It is
used when you want to combine result sets and eliminate duplicate rows.
Syntax:
Example:
50. Can you list and briefly describe the normal forms (1NF, 2NF, 3NF) in
SQL?
Normalization can take numerous forms, the most frequent of which are
1NF (First Normal Form), 2NF (Second Normal Form), and 3NF (Third
Normal Form). Here’s a quick rundown of each:
First Normal Form (1NF): In 1NF, each table cell should contain only a
single value, and each column should have a unique name. 1NF helps
in eliminating duplicate data and simplifies the queries. It is the
fundamental requirement for a well-structured relational database. 1NF
eliminates all the repeating groups of the data and also ensures that
the data is organized at its most basic granularity.
Second Normal Form (2NF): In 2NF, it eliminates the partial
dependencies, ensuring that each of the non-key attributes in the table
is directly related to the entire primary key. This further reduces data
redundancy and anomalies. The Second Normal form (2NF) eliminates
redundant data by requiring that each non-key attribute be dependent
on the primary key. In 2NF, each column should be directly related to
the primary key, and not to other columns.
Third Normal Form (3NF): Third Normal Form (3NF) builds on the
Second Normal Form (2NF) by requiring that all non-key attributes are
independent of each other. This means that each column should be
directly related to the primary key, and not to any other columns in the
same table.
51. Explain window functions in SQL. How do they differ from regular
aggregate functions?
SELECT col_name1,
window_function(col_name2)
OVER([PARTITION BY col_name1] [ORDER BY col_name3]) AS
new_col
FROM table_name;provides
Example:
SELECT
department,
AVG(salary) OVER(PARTITION BY department ORDER BY
employee_id) AS avg_salary
FROM
employees;
52. What are primary keys and foreign keys in SQL? Why are they
important?
Primary keys and foreign keys are two fundamental concepts in SQL that
are used to build and enforce connections between tables in a relational
database management system (RDBMS).
Primary key: Primary keys are used to ensure that the data in the
specific column is always unique. In this, a column cannot have a NULL
value. The primary key is either an existing table column or it’s
specifically generated by the database itself according to a sequence.
Importance of Primary Keys:
Uniqueness
Query Optimization
Data Integrity
Relationships
Data Retrieval
Foreign key: Foreign key is a group of column or a column in a
database table that provides a link between data in given two tables.
Here, the column references a column of another table.
Importance of Foreign Keys:
Relationships
Data Consistency
Query Efficiency
Referential Integrity
Cascade Actions
Database transactions are the set of operations that are usually used to
perform logical work. Database transactions mean that data in the
database has been changed. It is one of the major characteristics
provided in DBMS i.e. to protect the user’s data from system failure. It is
done by ensuring that all the data is restored to a consistent state when
the computer is restarted. It is any one execution of the user program.
Transaction’s one of the most important properties is that it contains a
finite number of steps.
They are important to maintain data integrity because they ensure that the
database always remains in a valid and consistent state, even in the
presence of multiple users or several operations. Database transactions
are essential for maintaining data integrity because they enforce ACID
properties i.e, atomicity, consistency, isolation, and durability properties.
Transactions provide a solid and robust mechanism to ensure that the
data remains accurate, consistent, and reliable in complex and concurrent
database environments. It would be challenging to guarantee data
integrity in relational database systems without database transactions.
54. Explain how NULL values are handled in SQL queries, and how you
can use functions like IS NULL and IS NOT NULL.
In SQL, NULL is a special value that usually represents that the value is
not present or absence of the value in a database column. For accurate
and meaningful data retrieval and manipulation, handling NULL becomes
crucial. SQL provides IS NULL and IS NOT NULL operators to work with
NULL values.
IS NULL: IS NULL operator is used to check whether an expression or
column contains a NULL value.
Syntax:
Syntax:
Example: In the below example, the query retrieves all rows from the
employee table where the first name does not contains NULL values.
Addition of redundancy
Data inconsistency and
2. takes place for better
redundancy is reduced.
execution of queries
Redundancy is added
Data redundancy is
4. instead of elimination or
eliminated or reduced.
reduction.
57. What are the dashboard, worksheet, Story, and Workbook in Tableau?
In Tableau, joining and blending are ways for combining data from various
tables or data sources. However, they are employed in various contexts
and have several major differences:
Tableau allows you to make many sorts of joins to mix data from
numerous tables or data sources. Tableau’s major join types are:
Inner Join: An inner join returns only the rows that have matching
values in both tables. Rows that do not have a match in the other table
are excluded from the result.
Left Join: A left join returns all the rows from the left table and
matching rows present in the right table. If there is no match in the
right table, null values are included in the result.
Right Join: A right join returns all the rows from the right table and
matching rows present in the left table. If there is no match in the left
table, null values are included.
Full Outer Join: A full outer join returns all the rows where there is a
match in either the left or right table. It includes all the rows from both
tables and fills in null values where there is no match.
64. What are the different data aggregation functions used in Tableau?
String
Numerical values
Date and time values
Boolean values
Geographic values
Date values
Cluster Values
68. What Are the Filters? Name the Different types of Filters available in
Tableau.
Filters are the crucial tools for data analysis and visualization in Tableau.
Filters let you set the requirements that data must meet in order to be
included or excluded, giving you control over which data will be shown in
your visualizations.
There are different types of filters in Tableau:
Extract Filter: These are used to filter the extracted data from the main
data source.
Data Source Filter: These filters are used to filter data at the data
source level, affecting all worksheets and dashboards that use the
same data source.
Dimension Filter: These filters are applied to the qualitative field and a
non-aggregated filter.
Context Filter: These filters are used to define a context to your data,
creating a temporary subset of data based on the filter conditions.
Measure Filter: These filters can be used in performing different
aggregation functions. They are applied to quantitative fields.
Table Calculation Filter: These filters are used to view data without
filtering any hidden data. They are applied after the view has been
created.
Sets: Sets are used to build custom data subsets based on predefined
conditions or standards. They give you the ability to dynamically
segment your data, which facilitates the analysis and visualization of
particular subsets. Sets can be categorical or numeric and can be built
from dimensions or measures. They are flexible tools that let you
compare subsets, highlight certain data points, or perform real-time
calculations. For instance, you can construct a set of “Hot Leads”
based on the potential customers with high engagement score or
create a set of high-value customers by choosing customers with total
purchases above a pre-determined level. Sets are dynamic and
adaptable for a variety of analytical tasks because they can change as
the data does.
Groups: Groups are used to combine people(dimension values) into
higher level categories. They do this by grouping comparable values
into useful categories, which simplifies complex data. Group members
are fixed and do not alter as a result of the data since groups are static.
Groups, which are typically constructed from dimensions, are crucial
for classifying and labeling data points. For instance, you can combine
small subcategories of product into larger categories or make your
own dimension by combining different dimensions. Data can be
presented and organized in a structed form using groups, which makes
it easier to analyze and visualize.
70. Explain the different types of charts available in Tableau with their
significance.
Bar Chart: They are useful for comparing categorical data and can be
used show the distribution of data across categories or to compare
value between categories.
Line Chart: Line chart are excellent for showing trends and changes
over time. They are commonly used for time series data to visualize
how single measure changes over time.
Area Chart: They are same as line chart but the area under the line is
colored in area chart. They are used with different multiple variables in
data to demonstrate the differences between the variables.
Pie Chart: It shows parts of a whole. They are useful for illustrating the
distribution of data where each category corresponds to a share of the
total.
Tree Maps: They show hierarchical data as nested rectangles. They are
helpful for illustrating hierarchical structures, such as organizational or
file directories.
Bubble chart: Bubble charts are valuable for visualizing and comparing
data points with three different attributes. They are useful when you
want to show relationships, highlight data clusters, etc.
Scatter Plot: They are used to display the relationship between two
continuous variables. They help find correlations, clusters or outliers in
the data.
Density Map: Density maps are used to represent the distribution and
concentration of data points or values within a 2D space.
Heat Map: Heat maps are used to display data on a grid, where color
represents values. They are useful for visualizing large datasets and
identifying patterns.
Symbol Map: Symbol maps are used to represent geographic data by
placing symbols or markers on a map to convey information about
specific locations.
Gannt Chart: Gantt charts are used for project management to
visualize tasks, their durations, and dependencies over time.
Bullet Graph: They are used for tracking progress towards a goal. They
provide a compact way to display a measure, target and performance
ranges.
Box Plot(Box and Whisker) : They are used to display the distribution
of data and identify outliers. They show median, quartiles, and potential
outliers.
Connect with the data source. Create a chart by dragging and dropping
the dimension and measure into “column” and “rows” shelf,
respectively.
Duplicate the chart by right click on the chart and select “Duplicate”.
This will create the duplicate of the chart.
In the duplicated chart, change the measure you want to display by
dragging the new measure to the “columns” or “rows” shelf, replacing
the existing measure.
In the second chart, assign the measure to different axis by clicking on
the “dual-axis”. This will create two separate axes on the chart.
Right click on one of the axes and select “synchronize axis”. Adjust
formatting, colors and labels as needed. You now have a dual-axis
chart.
A Gantt Chart has horizontal bars and sets out on two axes. The tasks are
represented by Y-axis, and the time estimates are represented by the X-
axis. It is an excellent approach to show which tasks may be completed
concurrently, which needs to be prioritized, and how they are dependent
on one another.
Gantt Chart is a visual representation of project schedules, timelines or
task durations. To illustrate tasks, their start and end dates, and their
dependencies, this common form of chat is used in project management.
Gantt charts are a useful tool in tableau for tracking and analyzing project
progress and deadlines since you can build them using a variety of
dimensions and measures.
If two measures have the same scale and share the same axis, they can
be combined using the blended axis function. The trends could be
misinterpreted if the scales of the two measures are dissimilar.
78. How to handle Null, incorrect data types and special values in
Tableau?
Handling null values, erroneous data types, and unusual values is an
important element of Tableau data preparation. The following are some
popular strategies and recommended practices for coping with data
issues:
83. What is data source filtering, and how does it impact performance?
Open the tableau workbook and select the visualization you want to
export.
Go to the “File” menu, select “Export”.
After selecting “Export” a sub menu will appear with various export
options. Choose the format you want to export to. (PDF, image, etc.,)
Depending on the chosen export format, you may have some
configuration options that you can change according to the needs.
Specify the directory or the folder where you want to save the exported
fie and name it.
Once the settings are configured, click on “save” or “Export”.
Also, Explore
GeeksforGeeks 9
Similar Reads
Company Explore
About Us Job-A-Thon Hiring Challenge
Legal Hack-A-Thon
Careers GfG Weekly Contest
In Media Offline Classes (Delhi/NCR)
Contact Us DSA in JAVA/C++
Advertise with us Master System Design
GFG Corporate Solution Master CP
Placement Training Program GeeksforGeeks Videos
Geeks Community
Languages DSA
Python Data Structures
Java Algorithms
C++ DSA for Beginners
PHP Basic DSA Problems
GoLang DSA Roadmap
SQL DSA Interview Questions
R Language Competitive Programming
Android Tutorial