Data Ques
Data Ques
Frequently Asked
DATA Analyst
Questions
By MAANG Companies
*Disclaimer*
Everyone learns uniquely.
www.bosscoderacademy.com 1
Question - 1
of data fast
decision making.
www.bosscoderacademy.com 2
Question - 2
Explain how you would approach cleaning a
dataset with 10% missing values.
To clean a dataset with 10% missing values, I would
Assess the missing data : Which columns have got missing
values and how many records are there with these missing
values
Decide on handling methods :
If there is a significant number of missing values for
numerical data columns, use imputation (mean, median
model-based), or delete rows/columns alike
For categorical variables it is better to use the mode
imputation or create a new category for instance
“Unknown”
Decide on handling methods : Ensure while scrubbing the
data it does not have loss of patterns and doesn’t have
elements of biasness.
www.bosscoderacademy.com 3
Question - 3
How do you design an ETL pipeline for real-time
analytics?
To design an ETL (Extract, Transform, Load) pipeline for real-
time analytics:
Extract : Some of the common real-time data source are
the use of message queues such as Kafka or APIs for real-
time data access
Transform : Do operations such as filtering, aggregation, or
enrichment on the fly, typically employing stream
processing engines (Apache Flink, Spark Streaming ...)
Load : Load transformed data to real-time data storage
such as a real-time data-warehouse like AWS Redshift,
google big query etc.
www.bosscoderacademy.com 4
Question - 4
How do you ensure data quality in a project?
To ensure data quality in a project, I focus on:
Clear Data Collection Standards : Develop guidelines in
terms of the adherence of the process used in data
collection
Data Validation : This is in several cases done with the use
of software, but it is important to perform this check
frequently
Data Cleaning : It usually involves eliminating, for example,
duplicate or unnecessary results of database queries
Timely Updates : Update the data and keep the same
up-to-date
Regular Audits : The reviews should be done periodically in
order to ensure that its contents are accurate and complete.
www.bosscoderacademy.com 5
Question - 5
hypothesis testing?
hypothesis.
Question - 6
www.bosscoderacademy.com 6
Question - 7
Explain how you would optimize a SQL query for
large datasets.
To optimize a SQL query for large datasets:
Use indexes : Index a column on fields to be searched on
and fields used in a JOIN
JOIN statement as well as in an
ORDER
ORDER BYB statement
Limit result set : To limit the number of results returned use
LIMIT
LIMIT or TOP
TOP
www.bosscoderacademy.com 7
Question - 8
How do you handle skewed data distributions?
To handle skewed data distributions, you can use techniques
like:
Log Transformation : If the data are skewed, and the
variable is continuous, use log or square root
transformation
Winsorization : Maximum and minimum numbers for
variables to limit effect of outliers
Resampling : Undersampling or oversampling is the method
used when there is imbalanced data
Model Selection : This one should be done using algorithms
that are less likely to be substantially affected by skewed
data, such as tree based models.
www.bosscoderacademy.com 8
Question - 9
What is a Type I error and Type II error? Give
examples
To handle skewed data distributions, you can use techniques
like:
Type I error on the other hand is made when the null
hypothesis is rejected when in actual sense it is true.
Example: An X-ray, for example, improperly suggests a
person free of a certain disease actually has the disease
In Type II error also known as false negative, you do not
reject a false null hypothesis or fail to find them to be not
true.
Example: An m-test fails to ‘screen out’ a person who, in
reality, has a disease.
www.bosscoderacademy.com 9
Question - 10
What is the difference between LEFT JOIN and
FULL OUTER JOIN in SQL?
In SQL:
LEFT JOI : Brings back the all of the records of the left
LEFT JOIN
table with the matching record from the right table. In case
where there is no match, NULL values are returned from the
columns in the right table
FULL JOI : Brings all records if there is a match in
OUTERJOIN
FULL OUTER
any of the two tables. It contains combination of unmatched
rows of both left and right tables, where unmatched column
has NULL values.
www.bosscoderacademy.com 10
Question - 11
product performance?
key features:
a glance
product/s)
www.bosscoderacademy.com 11
Question - 12
How do you decide between RDBMS and NoSQL
for a project?
RDBMS (e.g., MySQL, PostgreSQL) : Suits well when
dealing with huge volumes of structured data, where the
assemblies of data are intricate and for situations where
reliable transactions are preferred
NoSQL (e.g., MongoDB, Cassandra) : Suitable to handle
data that are ill defined or partially defined, excellent for
growing businesses, and when the structure may evolve
over some time.
Question - 13
Explain the concept of data normalization in
databases.
Data normalization in databases can be defined as the action
of arranging data to decrease the problems of repetition as
well as to increase the consistency of data. It is necessary for
breaking a site into more discreet tables to reduce a number of
replicates, and establishing associations between them. This
makes the database much more flexible as well as easy to
manage than using other complicated structures.
www.bosscoderacademy.com 12
Question - 14
How do you detect and handle outliers in a
dataset?
Detect Outliers :
It is better to use such graphical representation such as
box plot or scatter plot
Use statistical approaches such as the IQR rule or the
Z-scores to so doing
Handle Outliers :
Remove : In particular, it may happen that an outlier
results from data entry errors or it is not useful for the
analysi
Transform : Reduce the impact by using some of the
methods like the log transformation
Cap/Impute: It is suggested replacing outliers with
maximum or median value.
www.bosscoderacademy.com 13
Question - 15
How do you approach A/B testing?
My conception of the A/B testing is based on beginning with a
specific objective, for example, increasing conversion. Then, I
split the audience into two groups: one gets to see the first
test (control) and the other gets an opportunity to look at the
second test (variation). I make sure that the test is run for long
enough to collect adequate data then statistics must be used
to find out the version that performed well.
Question - 16
What is the difference between batch
processing and stream processing?
Batch Processing : Analyses a large amount of data that
has been gathered over the period. It is a batch process
which means it is not done interactively (for instance
preparing daily, weekly or monthly reports)
Stream Processing : Data gets analyzed in real-time, a
moment when the data is being produced, and you can
work with it immediately (for instance, analyzing traffic on
the website in the process of its functioning).
www.bosscoderacademy.com 14
Question - 17
How do you optimize joins in SQL queries?
To optimize joins in SQL queries:
Use Proper Indexing : Make it possible that only indexed
columns are used in the join conditions
Filter Early : By adding the filters in the WHERE
WHERE or ON
ON
clause, always try to reduce the dataset before joining it
Choose the Right Join Type : Always opt for INNER JOI
INNER JOIN
since it has been stated to be faster than using an
OUTER JOI
OUTER JOIN
www.bosscoderacademy.com 15
Question - 18
How do you design a data warehouse with a
star schema?
To design a data warehouse with a star schema :
Identify the Business Process : Identify out the type of flow
you wish to model, for instance; sale flow, inventory flow
and so on
Define the Fact Table : Make special table that will contain
numeric values like amount of sales, number of pieces sold
and etc
Define Dimension Tables : Generate other tables for the
descriptive attributes (time, product, customer, location
etc.)
Establish Relationships : Relation the fact table to each of
the dimension tables using the primary key and foreign key
Optimize for Queries : Make sure that the schema is
decomposed and kept as plain as possible for needed
queries.
www.bosscoderacademy.com 16
Question - 19
How would you calculate the 90th percentile of
sales in SQL?
Calculating of the 90th percentile of sales in SQL is easier if
using built-in function named PERCENTILE_CONT, which
computes a percentile within a given set of values arranged
according to the specified order.
sql
SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY
sales) AS percentile_90
FROM sales_table;
This will return the 90th percentile of the sales column from
the sales_table.
www.bosscoderacademy.com 17
Question - 20
What is the difference between a Snowflake
schema and a Star schema?
A Star schema is a basic data model characterized by a fact
table and one or more dimension tables. Every dimension
table is connected with the fact table creating a star-like
structure of organization
A Snowflake schema is slightly more difficult to
understand. This is somewhat like the Star schema but the
dimension tables are normalized into several related tables
that create a ‘snowflake’ structure.
Question - 21
What is the role of indexing in databases?
Database indexing enhances the efficiency of data searching
through formation of a structure for data rows called table or
tree, such that the database does not need to go through the
entire table in order to find a particular row. Like in books, it is
an index helping to search faster.
www.bosscoderacademy.com 18
Question - 22
How would you calculate churn rate in SQL
To calculate churn rate in SQL:
Total customers at the beginning of a period :
total_customers_star
Quantify churned or the number of customers who left
during the period (churned_customers).
Use the formula :
sql
SELECT
(CAST(churned_customeers AS FLOAT) /
total_customers_start) * 100 AS churn_rate
FROM (
SELECT
FROM customers
) AS churned;
SELECT
www.bosscoderacademy.com 19
COUNT (*) AS total_customers_start
FROM customers
) AS start;
Question - 23
How do you decide between using Python or
SQL for a data task?
Use Python where you need to perform rather heavy
calculations, carry out analytics or machine learning, or format
free form data. SQL should be used preferably when it is
directly required to operate on relational databases through
queries for filter, aggregate function or joining of big frames.
www.bosscoderacademy.com 20
Question - 24
Supervised Learning :
Unsupervised Learning :
www.bosscoderacademy.com 21
Question - 25
How do you prioritize tasks in a data analytics
project?
Prioritize tasks in a data analytics project using these steps :
Define Objectives : Learn what the project is about and
what the major questions to answer are
Assess Impact : Concentrate on what you think is most
critical or useful to your work
Sequence Dependencies : Complete basic activities before
deploying, for example, analytical procedures
Allocate Resources : Fit the tasks to the characteristics of
the team members and resources in their disposal
Set Timelines : Divide work on the project into particular
stages with corresponding dates
Iterate : Carry out investigation based on the results and
update the plan according to the development in the
project.
www.bosscoderacademy.com 22
Question - 26
How do you decide which visualization to use
for a given dataset?
To decide on a visualization :
Understand Your Data : Consider the type of data
(categorical and numerical) as well as the type of
relationship to establish which is appropriate namely
comparison, distribution, trends, or composition
Define Your Goal : Be clear about what you need to portray
for example, temporal changes, relative sizes, relationships
Choose the Right Chart :
Comparison : Bar chart, line chart
Distribution : Histogram, box plot
Trends : Line chart
Composition : Pie chart, stacked bar chart
Relationships : Scatter plot, bubble chart.
www.bosscoderacademy.com 23
Question - 27
What is the difference between UNION and
UNION ALL in SQL?
The key difference between UNION
UNION and UNION AL in SQL is :
UNION ALL
www.bosscoderacademy.com 24
Question - 28
How do you ensure the scalability of a data
pipeline?
To ensure scalability in a data pipeline :
Distributed Processing : Organize large datasets through
the utilization of special platforms such as Apache Spark, or
Kafka
Horizontal Scaling : Invest in more machines, or nodes for
handling increased workload
Modular Design : Make pipelines in standalone and
composable steps to scale up the process more efficiently
Auto-scaling : Use service models that are connected with
the cloud scenario and which are able to increase on their
own
Optimized Storage : Some general and cheap storage
solutions are cloud object storage which includes S3 and
GCS
Monitoring and Load Balancing : Be consistent in analyzing
frequent performance and avoid large variations in the
distribution of load.
www.bosscoderacademy.com 25
Question - 29
How do you handle correlated variables in
predictive modeling?
To handle correlated variables in predictive modeling :
Identify Correlation : When using the correlation matrix
select variables with high correlation, for instance Pearson
correlation coefficient greater than 0.8
Remove Redundancy : There should be one measure
retained while other measures that can offer similar
information should be removed
Use Regularization : The correlation is managed by
methods like Lasso or Ridge regression because it punishes
less important features
Dimensionality Reduction : Use methods such as PCA in
order to replace several related variables by several
orthogonal components
Domain Knowledge : Choose the variable that best fits
what you consider to be the problem with the organization.
www.bosscoderacademy.com 26
Question - 30
Explain the difference between rank() and
dense_rank() in SQL.
The key difference between RANK() and DENSE_RANK() in
SQL lies in how they handle ranking when there are ties :
RANK() : May leave gaps in the ranking depending on the
number of ties present for the ranking. For example, where
two rows have the same rank of 1, then the next row would
have a rank of 3 (1, 1, 3)
DENSE_RANK() : It does not create gaps in ranking. If there
are equal two rows as the highest rank, the next rank will be
the second rank (1st rank = 1, 2nd rank = 1).
Both are used to order the rows in order to give each row a
rank according to the order given.
www.bosscoderacademy.com 27
Why Bosscoder?
2200+ Alumni placed at Top Product-
based companies.
Explore More