120 DS-With Answer
120 DS-With Answer
1. (Given a Dataset) Analyze this dataset and give me a model that can predict this
response variable.
• Problem Determination -> Data Cleaning -> Feature Engineering -> Modeling
• Benchmark Models
o Linear Regression (Ridge or Lasso) for regression
o Logistic Regression for Classification
• Advanced Models
o Random Forest, Boosting Trees, and so on
Scikit-Learn, XGBoost, LightGBM, CatBoost
• Determine if the problem is classification or regression
• Plot and visualize the data.
• Start by fitting a simple model (multivariate regression, logistic regression), do
some feature engineering accordingly, and then try some complicated models.
Always split the dataset into train, validation, test dataset and use cross validation
to check their performance.
• Favor simple models that run quickly and you can easily explain.
• Mention cross validation as a means to evaluate the model.
2. What could be some issues if the distribution of the test data is significantly
different than the distribution of the training data?
• The model that has high training accuracy might have low test accuracy. Without
further knowledge, it is hard to know which dataset represents the population
data and thus the generalizability of the algorithm is hard to measure. This
should be mitigated by repeated splitting of train vs. test dataset (as in cross
validation).
• When there is a change in data distribution, this is called the dataset shift. If the
train and test data has a different distribution, then the classifier would likely
overfit to the train data.
• This issue can be overcome by using a more general learning method.
• This can occur when:
o P(y|x) are the same but P(x) are different. (covariate shift)
o P(y|x) are different. (concept shift)
• The causes can be:
o Training samples are obtained in a biased way. (sample selection bias)
o Train is different from test because of temporal, spatial changes. (non-
stationary environments)
• Solution to covariate shift
o importance weighted cv
3. What are some ways I can make my model more robust to outliers?
4. What are some differences you would expect in a model that minimizes squared
error, versus a model that minimizes absolute error? In which cases would each
error metric be appropriate?
• MSE is more strict to having outliers. MAE is more robust in that sense, but is
harder to fit the model for because it cannot be numerically optimized. So when
there are less variability in the model and the model is computationally easy to
fit, we should use MAE, and if that’s not the case, we should use MSE.
• MSE: easier to compute the gradient, MAE: linear programming needed to
compute the gradient
• MAE more robust to outliers. If the consequences of large errors are great, use
MSE
• MSE corresponds to maximizing likelihood of Gaussian random variables
5. What error metric would you use to evaluate how good a binary classifier is?
What if the classes are imbalanced? What if there are more than 2 groups?
6. What are various ways to predict a binary response variable? Can you compare
two of them and tell me when one would be more appropriate? What’s the
difference between these? (SVM, Logistic Regression, Naive Bayes, Decision Tree,
etc.)
9. Given training data on tweets and their retweets, how would you predict the
number of retweets of a given tweet after 7 days after only observing 2 days
worth of data?
• Build a time series model with the training data with a seven day cycle and then
use that for a new data with only 2 days data.
• Ask someone for more details.
• Build a regression function to estimate the number of retweets as a function of
time t
• to determine if one regression function can be built, see if there are clusters in
terms of the trends in the number of retweets
• if not, we have to add features to the regression function
• features + # of retweets on the first and the second day -> predict the seventh
day
• https://en.wikipedia.org/wiki/Dynamic_time_warping
10. How could you collect and analyze data to use social media to predict the
weather?
• We can collect social media data using twitter, Facebook, instagram API’s.
• Then, for example, for twitter, we can construct features from each tweet, e.g. the
tweeted date, number of favorites, retweets, and of course, the features created
from the tweeted content itself.
• Then use a multivariate time series model to predict the weather.
• Ask someone for more details.
11. How would you construct a feed to show relevant content for a site that
involves user interactions with items?
12. How would you design the people you may know feature on LinkedIn or
Facebook?
13. How would you predict who someone may want to send a Snapchat or Gmail
to?
• for each user, assign a score of how likely someone would send an email to
• the rest is feature engineering:
o number of past emails, how many responses, the last time they exchanged
an email, whether the last email ends with a question mark, features about
the other users, etc.
• Ask someone for more details.
• People who someone sent emails the most in the past, conditioning on time
decay.
14. How would you suggest to a franchise where to open a new store?
• build a master dataset with local demographic information available for each
location.
o local income levels, proximity to traffic, weather, population density,
proximity to other businesses
o a reference dataset on local, regional, and national macroeconomic
conditions (e.g. unemployment, inflation, prime interest rate, etc.)
o any data on the local franchise owner-operators, to the degree the
manager
• identify a set of KPIs acceptable to the management that had requested the
analysis concerning the most desirable factors surrounding a franchise
o quarterly operating profit, ROI, EVA, pay-down rate, etc.
• run econometric models to understand the relative significance of each variable
• run machine learning algorithms to predict the performance of each location
candidate
15. In a search engine, given partial data on what the user has typed, how would
you predict the user’s eventual search query?
16. Given a database of all previous alumni donations to your university, how
would you predict which recent alumni are most likely to donate?
17. You’re Uber and you want to design a heatmap to recommend to drivers where
to wait for a passenger. How would you approach this?
• Based on the past pickup location of passengers around the same time of the
day, day of the week (month, year), construct
• Ask someone for more details.
• Based on the number of past pickups
o account for periodicity (seasonal, monthly, weekly, daily, hourly)
o special events (concerts, festivals, etc.) from tweets
18. How would you build a model to predict a March Madness bracket?
• One vector each for team A and B. Take the difference of the two vectors and use
that as an input to predict the probability that team A would win by training the
model. Train the models using past tournament data and make a prediction for
the new tournament by running the trained model for each round of the
tournament
• Some extensions:
o Experiment with different ways of consolidating the 2 team vectors into
one (e.g concantenating, averaging, etc)
o Consider using a RNN type model that looks at time series data.
19. You want to run a regression to predict the probability of a flight delay, but
there are flights with delays of up to 12 hours that are really messing up your
model. How can you address this?
if k == n:
return [[1] * n]
ans = []
space = n - k + 1
for i in range(space):
assignment = [0] * (i + 1)
assignment[i] = 1
for c in n_choose_k(n - i - 1, k - 1):
ans.append(assignment + c)
return ans
• Store all the hashtags in a dictionary and use priority queue to solve the top-k
problem
• An extension will be top-k problem using Hadoop/MapReduce
• https://en.wikipedia.org/wiki/Knapsack_problem
• Greedy solution (add the best v/w as much as possible and move on to the next)
• Dynamic programming
4. Program an algorithm to find the best approximate solution to the traveling
salesman problem in a given time.
• https://en.wikipedia.org/wiki/Travelling_salesman_problem
• Greedy
• Dynamic programming
5. You have a stream of data coming in of size n, but you don’t know what n is
ahead of time. Write an algorithm that will take a random sample of k elements.
Can you write one that takes O(k) space?
• Reservoir sampling
8. When can parallelism make your algorithms run faster? When could it make
your algorithms run slower?
9. What are the different types of joins? What are the differences between them?
• Inner Join, Left Join, Right Join, Outer Join, Self Join
10. Why might a join on a subquery be slow? How might you speed it up?
11. Describe the difference between primary keys and foreign keys in a SQL
database.
• Primary keys are columns whose value combinations must be unique in a specific
table so that each row can be referenced uniquely.
• Foreign keys are columns that references columns (often primary keys) in other
tables.
SELECT f.faculty_name
FROM COURSES c
JOIN COURSE_FACULTY cf
ON c.course_id = cf.course_id
JOIN FACULTY
ON f.faculty_id = cf.faculty_id
WHERE c.course_name = xxx;
13. Given a IMPRESSIONS table with ad_id, click (an indicator that the ad was
clicked), and date, write a SQL query that will tell me the click-through-rate of
each ad by month.
14. Write a query that returns the name of each department and a count of the
number of employees in each:
1. Bobo the amoeba has a 25%, 25%, and 50% chance of producing 0, 1, or 2 o
spring, respectively. Each of Bobo’s descendants also have the same probabilities.
What is the probability that Bobo’s lineage dies out?
2. In any 15-minute interval, there is a 20% probability that you will see at least
one shooting star. What is the probability that you see at least one shooting star in
the period of an hour?
• 1-(0.8)^4 = 0.5904
• Or, we can use Poisson processes
3. How can you generate a random number between 1 - 7 with only a die?
• Quora Answer
4. How can you get a fair coin toss if someone hands you a coin that is weighted to
come up heads more often than tails?
• Flip twice:
o HT --> H
o TH --> T
• If HH or TT, repeat.
5. You have an 50-50 mixture of two normal distributions with the same standard
deviation. How far apart do the means need to be in order for this distribution to
be bimodal?
6. Given draws from a normal distribution with known parameters, how can you
simulate draws from a uniform distribution?
8. You have a group of couples that decide to have children until they have their
first girl, after which they stop having children. What is the expected gender ratio
of the children that are born? What is the expected number of children each
couple will have?
• the outcome follows a multinomial distribution with n=12 and k=3. but the
classes are indistinguishable
• (12, 8) * (8, 4) * (4, 4) / (3, 3)
• 12! / (4!)^3 / 3!
10. Your hash function assigns each object to a number between 1:10, each with
equal probability. With 10 objects, what is the probability of a hash collision?
What is the expected number of hash collisions? What is the expected number of
hashes that are unused.
11. You call 2 UberX’s and 3 Lyfts. If the time that each takes to reach you is IID,
what is the probability that all the Lyfts arrive first? What is the probability that all
the UberX’s arrive first?
• 100+60-20=140
13. On a dating site, users can select 5 out of 24 adjectives to describe themselves.
A match is declared between two users if they match on at least 4 adjectives. If
Alice and Bob randomly pick adjectives, what is the probability that they form a
match?
• 24C5*(1+5(24-5))/24C5*24C5 = 4/1771
14. A lazy high school senior types up application and envelopes to n different
colleges, but puts the applications randomly into the envelopes. What is the
expected number of applications that went to the right college?
• 1
15. Let’s say you have a very tall father. On average, what would you expect the
height of his son to be? Taller, equal, or shorter? What if you had a very short
father?
16. What’s the expected number of coin flips until you get two heads in a row?
What’s the expected number of coin flips until you get two tails in a row?
17. Let’s say we play a game where I keep flipping a coin until I get heads. If the
first time I get heads is on the nth coin, then I pay you 2n-1 dollars. How much
would you pay me to play this game?
• less than $3
• Quora reference
18. You have two coins, one of which is fair and comes up heads with a probability
1/2, and the other which is biased and comes up heads with probability 3/4. You
randomly pick coin and flip it twice, and get heads both times. What is the
probability that you picked the fair coin?
• 4/13
• Bayesian method
19. You have a 0.1% chance of picking up a coin with both heads, and a 99.9%
chance that you pick up a fair coin. You flip your coin and it comes up heads 10
times. What’s the chance that you picked up the fair coin, given the information
that you observed?
• Bayesian method
• https://en.wikipedia.org/wiki/P-value
Statistical Inference (15 questions)
1. In an A/B test, how can you check if assignment to the various buckets was truly
random?
• Plot the distributions of multiple features for both A and B and make sure that
they have the same shape. More rigorously, we can conduct a permutation test to
see if the distributions are the same.
• MANOVA to compare different means
2. What might be the benefits of running an A/A test, where you have two buckets
who are exposed to the exact same product?
3. What would be the hazards of letting users sneak a peek at the other bucket in
an A/B test?
• The user might not act the same suppose had they not seen the other bucket.
You are essentially adding additional variables of whether the user peeked the
other bucket, which are not random across groups.
4. What would be some issues if blogs decide to cover one of your experimental
groups?
• Same as the previous question. The above problem can happen in larger scale.
6. How would you run an A/B test for many variants, say 20 or more?
• one control, 20 treatment, if the sample size for each group is big enough.
• Ways to attempt to correct for this include changing your confidence level (e.g.
Bonferroni Correction) or doing family-wide tests before you dive in to the
individual metrics (e.g. Fisher's Protected LSD).
7. How would you run an A/B test if the observations are extremely right-skewed?
• lower the variability by modifying the KPI
• cap values
• percentile metrics
• log transform
• https://www.quora.com/How-would-you-run-an-A-B-test-if-the-observations-
are-extremely-right-skewed
8. I have two different experiments that both change the sign-up button to my
website. I want to test them at the same time. What kinds of things should I keep
in mind?
• exclusive -> ok
9. What is a p-value? What is the difference between type-1 and type-2 error?
• en.wikipedia.org/wiki/P-value
• type-1 error: rejecting Ho when Ho is true
• type-2 error: not rejecting Ho when Ha is true
10. You are AirBnB and you want to test the hypothesis that a greater number of
photographs increases the chances that a buyer selects the listing. How would you
test this hypothesis?
• For randomly selected listings with more than 1 pictures, hide 1 random picture
for group A, and show all for group B. Compare the booking rate for the two
groups.
• Ask someone for more details.
11. How would you design an experiment to determine the impact of latency on
user engagement?
• The best way I know to quantify the impact of performance is to isolate just that
factor using a slowdown experiment, i.e., add a delay in an A/B test.
12. What is maximum likelihood estimation? Could there be any case where it
doesn’t exist?
• A method for parameter optimization (fitting a model). We choose parameters so
as to maximize the likelihood function (how likely the outcome would happen
given the current data and our model).
• maximum likelihood estimation (MLE) is a method
of estimating the parameters of a statistical model given observations, by finding
the parameter values that maximize the likelihood of making the observations
given the parameters. MLE can be seen as a special case of the maximum a
posteriori estimation (MAP) that assumes a uniform prior distribution of the
parameters, or as a variant of the MAP that ignores the prior and which therefore
is unregularized.
• for gaussian mixtures, non parametric models, it doesn’t exist
13. What’s the difference between a MAP, MOM, MLE estimator? In which cases
would you want to use each?
• MAP estimates the posterior distribution given the prior distribution and data
which maximizes the likelihood function. MLE is a special case of MAP where the
prior is uninformative uniform distribution.
• MOM sets moment values and solves for the parameters. MOM is not used much
anymore because maximum likelihood estimators have higher probability of
being close to the quantities to be estimated and are more often unbiased.
• For example, 95% confidence interval is an interval that when constructed for a
set of samples each sampled in the same way, the constructed intervals include
the true mean 95% of the time.
• if confidence intervals are constructed using a given confidence level in an infinite
number of independent experiments, the proportion of those intervals that
contain the true value of the parameter will match the confidence level.
1. (Given a Dataset) Analyze this dataset and tell me what you can learn from it.
2. What is R2? What are some other metrics that could be better than R2 and why?
• Statistically
o It depends on the quality of your data, for example, if your data is biased,
just getting more data won’t help.
o It depends on your model. If your model suffers from high bias, getting
more data won’t improve your test results beyond a point. You’d need to
add more features, etc.
• Practically
o More data usually benefit the models
o Also there’s a tradeoff between having more data and the additional
storage, computational power, memory it requires. Hence, always think
about the cost of having more data.
5. What are advantages of plotting your data before performing analysis?
• Data sets have errors. You won't find them all but you might find some. That 212
year old man. That 9 foot tall woman.
• Variables can have skewness, outliers, etc. Then the arithmetic mean might not be
useful, which means the standard deviation isn't useful.
• Variables can be multimodal! If a variable is multimodal then anything based on
its mean or median is going to be suspect.
6. How can you make sure that you don’t analyze something that ends up
meaningless?
7. What is the role of trial and error in data analysis? What is the the role of
making a hypothesis before diving in?
8. How can you determine which features are the most important in your model?
10. You have several variables that are positively correlated with your response,
and you think combining all of the variables could give you a good prediction of
your response. However, you see that in the multiple linear regression, one of the
weights on the predictors is negative. What could be the issue?
• PCA
12. Now you have a feasible amount of predictors, but you’re fairly sure that you
don’t need all of them. How would you perform feature selection on the dataset?
13. Your linear regression didn’t run and communicates that there are an infinite
number of best estimates for the regression coefficients. What could be wrong?
• p > n.
• If some of the explanatory variables are perfectly correlated (positively or
negatively) then the coefficients would not be unique.
14. You run your regression on different subsets of your data, and find that in each
subset, the beta value for a certain variable varies wildly. What could be the issue
here?
15. What is the main idea behind ensemble learning? If I had many different
models that predicted the same response variable, what might I want to do to
incorporate all of the models? Would you expect this to perform better than an
individual model or worse?
For example, if you're doing binary classification, you can use all the probability outputs
of your individual models as inputs to a final logistic regression (or any model, really)
that can combine the probability estimates.
One very important point is to make sure that the output of your models are out-of-
sample predictions. This means that the predicted value for any row in your data-frame
should NOT depend on the actual value for that row.
16. Given that you have wifi data in your office, how would you determine which
rooms and areas are underutilized and over-utilized?
• If the data is more used in one room, then that one is over utilized!
• Maybe account for the room capacity and normalize the data.
17. How could you use GPS data from a car to determine the quality of a driver?
• Speed
• Driving paths
18. Given accelerometer, altitude, and fuel usage data from a car, how would you
determine the optimum acceleration pattern to drive over hills?
• Historical data?
19. Given position data of NBA players in a season’s games, how would you
evaluate a basketball player’s defensive ability?
21. Given location data of golf balls in games, how would construct a model that
can advise golfers where to aim?
22. You have 100 mathletes and 100 math problems. Each mathlete gets to choose
10 problems to solve. Given data on who got what problem correct, how would
you rank the problems in terms of difficulty?
• One way you could do this is by storing a "skill level" for each user and a
"difficulty level" for each problem. We assume that the probability that a user
solves a problem only depends on the skill of the user and the difficulty of the
problem.* Then we maximize the likelihood of the data to find the hidden skill
and difficulty levels.
• The Rasch model for dichotomous data takes the form:
{\displaystyle \Pr\{X_{ni}=1\}={\frac {\exp({\beta_{n}}-
{\delta_{i}})}{1+\exp({\beta_{n}}-{\delta_{i}})}},}
where is the ability of person and is the difficulty of item}.
23. You have 5000 people that rank 10 sushis in terms of saltiness. How would you
aggregate this data to estimate the true saltiness rank in each sushi?
• Some people would take the mean rank of each sushi. If I wanted something
simple, I would use the median, since ranks are (strictly speaking) ordinal and not
interval, so adding them is a bit risque (but people do it all the time and you
probably won't be far wrong).
24. Given data on congressional bills and which congressional representatives co-
sponsored the bills, how would you determine which other representatives are
most similar to yours in voting behavior? How would you evaluate who is the most
liberal? Most republican? Most bipartisan?
• collaborative filtering. you have your votes and we can calculate the similarity for
each representatives and select the most similar representative
• for liberal and republican parties, find the mean vector and find the
representative closest to the center point
25. How would you come up with an algorithm to detect plagiarism in online
content?
• reduce the text to a more compact form (e.g. fingerprinting, bag of words) then
compare those with other texts by calculating the similarity
26. You have data on all purchases of customers at a grocery store. Describe to me
how you would program an algorithm that would cluster the customers into
groups. How would you determine the appropriate number of clusters to include?
• K-means
• choose a small value of k that still has a low SSE (elbow method)
• Elbow method
27. Let’s say you’re building the recommended music engine at Spotify to
recommend people music based on past listening history. How would you
approach this problem?
• content-based filtering
• collaborative filtering
Product Metrics (15 questions)
• advertising-driven: Page-views and daily actives, CTR, CPC (cost per click)
o click-ads
o display-ads
• service-driven: number of purchases, conversion rate
4. What would be good metrics of success for a consumer product that relies
heavily on engagement and interaction? (Snapchat, Pinterest, Facebook, etc.) A
messaging product? (GroupMe, Hangouts, Snapchat, etc.)
• breakdown the KPI’s into what consists them and find where the change is
• then further breakdown that basic KPI by channel, user cluster, etc. and relate
them with any campaigns, changes in user behaviors in that segment
7. Growth for total number of tweets sent has been slow this month. What data
would you look at to determine the cause of the problem?
8. You’re a restaurant and are approached by Groupon to run a deal. What data
would you ask from them in order to determine whether or not to do the deal?
• for similar restaurants (they should define similarity), average increase in revenue
gain per coupon, average increase in customers per coupon
9. You are tasked with improving the efficiency of a subway system. Where would
you start?
• define efficiency
10. Say you are working on Facebook News Feed. What would be some metrics
that you think are important? How would you make the news each person gets
more relevant?
• rate for each action, duration users stay, CTR for sponsor feed posts
• ref. News Feed Optimization
o Affinity score: how close the content creator and the users are
o Weight: weight for the edge type (comment, like, tag, etc.). Emphasis on
features the company wants to promote
o Time decay: the older the less important
11. How would you measure the impact that sponsored stories on Facebook News
Feed have on user engagement? How would you determine the optimum balance
between sponsored stories and organic content on a user’s News Feed?
12. You are on the data science team at Uber and you are asked to start thinking
about surge pricing. What would be the objectives of such a product and how
would you start looking into this?
13. Say that you are Netflix. How would you determine what original series you
should invest in and create?
• Netflix uses data to estimate the potential market size for an original series
before giving it the go-ahead.
14. What kind of services would find churn (metric that tracks how many
customers leave the service) helpful? How would you calculate churn?
15. Let’s say that you’re are scheduling content for a content provider on
television. How would you determine the best times to schedule content?
• Data science
Suppose we are interested some characteristic of a population; for example, the average
height h of all adult males in the U.S. We can estimate h by drawing a random sample of
adult males in the U.S. and calculating the average height H in the sample. This is called
a point estimate of h. If the sample is large, H will be a good estimate of h, but by itself it
does not tell you how good it is.
That is the definition. Now, a few comments. Why the strange word "confidence," which is
never used by itself in probability or statistics? Why the scare quotes around the word
"likely" in the previous paragraph?
Confidence intervals are a tool of the frequentist school of statistics, which holds that we
should use the concepts of probability and randomness only to describe the mechanics of
certain kinds of sampling from populations, and not to describe our certainty or degree of
belief. Frequentists aim to use probability in an objective way.
For a frequentist, a statement like "the probability that the average height h of the all males
in the US lies between 70 and 74 inches is 95%" is meaningless: h is just a number we don't
know. It either lies in the interval (70, 74) or it doesn't.
Confidence intervals are a trick frequentists use to make statements resembling the one
above without violating their rules about how probability should be used. According to the
definition given above, it is legitimate to write:
P(L≤h≤U)=95%P(L≤h≤U)=95%
if (L, U) is derived according to a rule so that it does contain h for 95% of samples. This
resembles a subjective statement about our certainty that h lies in the range (L, U). But it
isn't: it's an objective statement about how often, in the long run, our random interval will
contain the fixed but unknown h, according to the randomness in our sampling. The
"subject" of the probability statement above looks like it is h (and that is the trick) but
actually it is the interval.
The probability statement above makes sense only before we draw a sample. What
happens after we draw the sample, when we find L=70 and U=74? There is a strong
temptation to plug into the previous expression to get:
P(70≤h≤74)=95%P(70≤h≤74)=95%
which is exactly the sentence held to be meaningless by frequentists earlier. h either lies
between the other two numbers or not; there is no probability involved.
Any inference we draw about h must of course happen after we draw the sample, but
frequentist rules prevent us from invoking probability at this point. So instead we refer back
to the randomness which gave us the interval (70, 74). It is not that this particular interval
contains h with 95% probability, but rather that an interval constructed in this way will
contain h 95% of the time
Everyone who uses confidence intervals, including every frequentist statistician on the
planet, would actually interpret the interval (70, 74) as representing a "likely range"
for h, implicitly invoking something like the illegal probability
statement P(70≤h≤74)=95%P(70≤h≤74)=95%. Without using some terms related to
probability, it is almost impossible to explain what useful relation the interval (70, 74) bears
to h.
Confidence intervals are useful mainly because we can misinterpret them as probability
statements about unknown parameters.
5. How would you explain to a group of senior executives why data is important?
• Examples