0% found this document useful (0 votes)

23 views9 pages

15 Data Analyst Questions

The document discusses various data analysis and machine learning concepts and techniques. It provides explanations and examples of metrics like R2, dimensionality reduction with PCA, dealing with missing data, multicollinearity, ensemble methods, clustering customers, collaborative filtering for recommendations, and identifying causes of changes in metrics.

Uploaded by

arasan77silambu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views9 pages

15 Data Analyst Questions

Uploaded by

arasan77silambu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Interview Preparation Document

DATA DOJO | SUNIL KAPPAL

DATA ANALYTICS
QUESTIONS EVERY
ANALYST SHOULD KNOW
What is R2? What are some other metrics that could be
better than R2 and why?
• goodness of fit measure. variance explained by the
regression / total variance
• the more predictors you add the higher R^2 becomes.
• hence use adjusted R^2 which adjusts for the degrees
of freedom
• or train error metrics

What is the curse of dimensionality?

• High dimensionality makes clustering hard, because
having lots of dimensions means that everything is "far
away" from each other.
• For example, to cover a fraction of the volume of the data
we need to capture a very wide range for each variable as

ATA DOJO
the number of variables increases
• All samples are close to the edge of the sample. And this
is a bad news because prediction is much more difficult
near the edges of the training sample.
• The sampling density decreases exponentially as p
increases and hence the data becomes much more sparse
without significantly more data.
• We should conduct PCA to reduce dimensionality

Let’s say you’re given an unfeasible amount of predictors in a

predictive modeling task. What are some ways to make the
prediction more feasible?
• PCA

1
Is more data always better?
• Statistically,
• It depends on the quality of your data, for example, if
your data is biased, just getting more data won’t help.
• It depends on your model. If your model suffers from
high bias, getting more data won’t improve your test
results beyond a point. You’d need to add more
features, etc.

• Practically,
• Also there’s a tradeoff between having more data and
the additional storage, computational power, memory
it requires. Hence, always think about the cost of
having more data.

ATA DOJO
What are advantages of plotting your data before per-
forming analysis?
• Data sets have errors. You won't find them all but you
might find some. That 212 year old man. That 9 foot tall
woman.

• Variables can have skewness, outliers etc. Then the

arithmetic mean might not be useful. Which means the
standard deviation isn't useful.

• Variables can be multimodal! If a variable is multimodal

then anything based on its mean or median is going to be
suspect.

2
How can you make sure that you don’t analyze
something that ends up meaningless?
• Proper exploratory data analysis.
• In every data analysis task, there's the exploratory phase
where you're just graphing things, testing things on small
sets of the data, summarizing simple statistics, and getting
rough ideas of what hypotheses you might want to pursue
further.
• Then there's the exploitatory phase, where you look deeply
into a set of hypotheses.
The exploratory phase will generate lots of possible
hypotheses, and the exploit

What is the role of trial and error in data analysis? What is the
the role of making a hypothesis before diving in?

ATA DOJO
Data analysis is a repetition of setting up a new hypothesis and
trying to refute the null hypothesis.
The scientific method is eminently inductive: we elaborate a
hypothesis, test it and refute it or not. As a result, we come up
with new hypotheses which are in turn tested and so on. This is an
iterative process, as science always is.

How can you determine which features are the most important in
your model?
Run the features though a Gradient Boosting Machine or Random
Forest to generate plots of relative importance and information
gain for each feature in the ensembles.
Look at the variables added in forward variable selection

3
How do you deal with some of your predictors being
missing?
• Remove rows with missing values - This works well if 1)
the values are missing randomly
• If you don't lose too much of the dataset after doing so.
• Build another predictive model to predict the missing
values - This could be a whole project in itself, so simple
techniques are usually used here.
• Use a model that can incorporate missing data - Like a
random forest, or any tree-based method.

You have several variables that are positively correlated

with your response, and you think combining all of the
variables could give you a good prediction of your
response. However, you see that in the multiple linear

ATA DOJO
regression, one of the weights on the predictors is
negative. What could be the issue?
• Multicollinearity refers to a situation in which two or more
explanatory variables in a multiple regression model are
highly linearly related.
• Leave the model as is, despite multicollinearity. The
presence of multicollinearity doesn't affect the efficiency of
extrapolating the fitted model to new data provided that
the predictor variables follow the same pattern of
multicollinearity in the new data as in the data on which
the regression model is based.
• principal component regression

4
What is the main idea behind ensemble learning? If I had many
different models that predicted the same response variable, what
might I want to do to incorporate all of the models? Would you
expect this to perform better than an individual model or worse?
• The assumption is that a group of weak learners can be combined
to form a strong learner.
• Hence the combined model is expected to perform better than an
individual model.
• Assumptions:
• average out biases
• reduce variance
• Bagging works because some underlying learning algorithms are
unstable: slightly different inputs leads to very different outputs. If
you can take advantage of this instability by running multiple
instances, it can be shown that the reduced instability leads to
lower error. If you want to understand why, the original bagging
paper( http://www.springerlink.com/cont...) has a section called

ATA DOJO
"why bagging works"
• Boosting works because of the focus on better defining the
"decision edge". By reweighting examples near the margin (the
positive and negative examples) you get a reduced error
(see http://citeseerx.ist.psu.edu/vie...)
• Use the outputs of your models as inputs to a meta-model.
• For example, if you're doing binary classification, you can use all
the probability outputs of your individual models as inputs to a
final logistic regression (or any model, really) that can combine
the probability estimates.
• One very important point is to make sure that the output of your
models are out-of-sample predictions. This means that the
predicted value for any row in your dataframe should NOT
depend on the actual value for that row.

5
You have 5000 people that rank 10 sushis in terms of
saltiness. How would you aggregate this data to estimate
the true saltiness rank in each sushi?
• Some people would take the mean rank of each sushi. If I
wanted something simple, I would use the median, since
ranks are (strictly speaking) ordinal and not interval, so
adding them is a bit risque (but people do it all the time
and you probably won't be far wrong).

How would you come up with an algorithm to detect

plagiarism in online content?
• Reduce the text to a more compact form (e.g.
fingerprinting, bag of words) then compare those with
other texts by calculating the similarity

ATA DOJO
You have data on all purchases of customers at a grocery
store. Describe to me how you would program an
algorithm that would cluster the customers into groups.
How would you determine the appropriate number of
clusters to include?
• KMeans
• choose a small value of k that still has a low SSE (elbow
method)
• https://bl.ocks.org/rpgove/0060ff3b656618e9136b

6
Let's say you're building the recommended music engine
at Spotify to recommend people music based on past
listening history. How would you approach this problem?
• collaborative filtering

A certain metric is violating your expectations by going

down or up more than you expect. How would you try to
identify the cause of the change?
• breakdown the KPI’s into what consists them and find
where the change is
• then further breakdown that basic KPI by channel, user
cluster, etc. and relate them with any campaigns, changes in
user behaviors in that segment

ATA DOJO
You’re a restaurant and are approached by Groupon to run
a deal. What data would you ask from them in order to
determine whether or not to do the deal?
• for similar restaurants (they should define similarity),
average increase in revenue gain per coupon, average
increase in customers per coupon, number of meals sold

ATA DOJO

Ipc - Jedec J-STD-020C
100% (1)
Ipc - Jedec J-STD-020C
14 pages
Astrology and Winning The Lottery
70% (10)
Astrology and Winning The Lottery
5 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
Key To b1
No ratings yet
Key To b1
16 pages
TOEFL Reading Practice
No ratings yet
TOEFL Reading Practice
142 pages
Petrel 2014 1 Release Notes
No ratings yet
Petrel 2014 1 Release Notes
46 pages
Objective:: Power Plant Lab (Me-223L) Experiment No: 6 Title: Demonistration of Steam Engine
No ratings yet
Objective:: Power Plant Lab (Me-223L) Experiment No: 6 Title: Demonistration of Steam Engine
5 pages
Troubleshooting GEFANUC 90 30
No ratings yet
Troubleshooting GEFANUC 90 30
18 pages
Verbal Autopsy Standards 2022 Who Verbal Autopsy Instrument v1 Final
No ratings yet
Verbal Autopsy Standards 2022 Who Verbal Autopsy Instrument v1 Final
40 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
Cambridge IGCSE: PHYSICS 0625/41
No ratings yet
Cambridge IGCSE: PHYSICS 0625/41
16 pages
Data Analysis (27 Questions) : 1. (Given A Dataset) Analyze This Dataset and Tell Me What You Can Learn From It
No ratings yet
Data Analysis (27 Questions) : 1. (Given A Dataset) Analyze This Dataset and Tell Me What You Can Learn From It
28 pages
Dual Clutch Transmission
100% (1)
Dual Clutch Transmission
7 pages
General-Strategies PDF
No ratings yet
General-Strategies PDF
28 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Student Guide M2
No ratings yet
Student Guide M2
49 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data Science Interview Preparation (#DAY 10)
No ratings yet
Data Science Interview Preparation (#DAY 10)
11 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Activity On The Waves
No ratings yet
Activity On The Waves
1 page
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
How Living Things Grow and Change
No ratings yet
How Living Things Grow and Change
14 pages
Account STMT
No ratings yet
Account STMT
2 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
CTR-12 - FPSO Firenze - Clarification Report - Ph-1 Presv Items
100% (1)
CTR-12 - FPSO Firenze - Clarification Report - Ph-1 Presv Items
3 pages
Datascience Interview
100% (1)
Datascience Interview
31 pages
M D A I C: Measure Define Improve Control
No ratings yet
M D A I C: Measure Define Improve Control
1 page
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Adobe Scan 04-Mar-2024
No ratings yet
Adobe Scan 04-Mar-2024
12 pages
Proyecto Salina Cruz Mediana Tension
No ratings yet
Proyecto Salina Cruz Mediana Tension
1 page
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
21 pages
Dipak Jha Booking - Com - Confirmation
No ratings yet
Dipak Jha Booking - Com - Confirmation
2 pages
NVENC VideoEncoder API ProgGuide
No ratings yet
NVENC VideoEncoder API ProgGuide
37 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Dwdmsem 6 QB
No ratings yet
Dwdmsem 6 QB
13 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
6 SQL Bangla Tutorials
No ratings yet
6 SQL Bangla Tutorials
16 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
No ratings yet
What Is Business Analytics?: Predictive Analytics Descriptive Analytics Prescriptive Analytics
35 pages
Job Vacancies Beatrice (Mine)
No ratings yet
Job Vacancies Beatrice (Mine)
3 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Security Aspects in IoT Based Cloud Computing
No ratings yet
Security Aspects in IoT Based Cloud Computing
12 pages
A Different Approach To Jensens Alpha and Returning Ranking
No ratings yet
A Different Approach To Jensens Alpha and Returning Ranking
18 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
BACTERIAL QUALITY AND DEPURATION OF THE GREEN MUSSEL Perna Viridis From Natural Beds
No ratings yet
BACTERIAL QUALITY AND DEPURATION OF THE GREEN MUSSEL Perna Viridis From Natural Beds
6 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Basic Data Science Interview Questions
No ratings yet
Basic Data Science Interview Questions
18 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
32 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
DATA SCIENCE iNTERVIEW QUESTION
No ratings yet
DATA SCIENCE iNTERVIEW QUESTION
42 pages
CH 3-Final
No ratings yet
CH 3-Final
39 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
Crack Data Science Interview 1731300339
No ratings yet
Crack Data Science Interview 1731300339
132 pages
Data Final
No ratings yet
Data Final
17 pages
Da 1733591326
No ratings yet
Da 1733591326
132 pages
Data Mining MCQ (Multiple Choice Questions)
No ratings yet
Data Mining MCQ (Multiple Choice Questions)
7 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
UIIC AO Dataanalytics Syllabuscoveredthroughmcqs
No ratings yet
UIIC AO Dataanalytics Syllabuscoveredthroughmcqs
333 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
DM Merged
No ratings yet
DM Merged
169 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Content 3
No ratings yet
Content 3
7 pages
Thoits 1994 StressorsProblemSolvingIndividual
No ratings yet
Thoits 1994 StressorsProblemSolvingIndividual
19 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
No ratings yet
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
9 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
DM QB Ans
No ratings yet
DM QB Ans
47 pages
Can You Convert Into PDF or Word File
No ratings yet
Can You Convert Into PDF or Word File
4 pages
Outline Draft 1
No ratings yet
Outline Draft 1
3 pages
Simplified Viva EDA
No ratings yet
Simplified Viva EDA
7 pages
Astm d3359 22 English
No ratings yet
Astm d3359 22 English
8 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet

15 Data Analyst Questions

Uploaded by

15 Data Analyst Questions

Uploaded by

Interview Preparation Document

DATA DOJO | SUNIL KAPPAL

What is the curse of dimensionality?

Let’s say you’re given an unfeasible amount of predictors in a

• Variables can have skewness, outliers etc. Then the

• Variables can be multimodal! If a variable is multimodal

You have several variables that are positively correlated

How would you come up with an algorithm to detect

A certain metric is violating your expectations by going

You might also like