0% found this document useful (0 votes)
62 views235 pages

Ace The Data Science 2

The book 'Ace the Data Science Interview' by Kevin Huo and Nick Singh provides essential guidance for aspiring data scientists, including 201 real interview questions from top tech companies and Wall Street firms. It covers resume principles, portfolio project creation, cold emailing strategies, behavioral interview techniques, and technical interview preparation across various topics. The authors leverage their extensive experience in data science to help readers navigate the competitive job market and succeed in interviews.

Uploaded by

Chetan Suthar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views235 pages

Ace The Data Science 2

The book 'Ace the Data Science Interview' by Kevin Huo and Nick Singh provides essential guidance for aspiring data scientists, including 201 real interview questions from top tech companies and Wall Street firms. It covers resume principles, portfolio project creation, cold emailing strategies, behavioral interview techniques, and technical interview preparation across various topics. The authors leverage their extensive experience in data science to help readers navigate the competitive job market and succeed in interviews.

Uploaded by

Chetan Suthar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 235

— ACE THE —

DATA SCIENCE
INTERVIEW

201 Real Interview Questions Asked By


FAANG Tech Startups, & Wall Street

KEVIN HUO NICK SINGH


Ex-Facebook, Now Hedge Fund Ex-Facebook, Now Career Coach
About Kevin Huo
Kevin Huo is currently a Data Scientist at a Hedge Fund, and previously was a Data Scientist at
Faccbook working on Facebook Groups. He holds a degree in Computer Science from the University
of Pennsylvania and a degree in Business from Wharton. In college, he interned at Facebook,
Bloomberg, and on Wall Street. On the side, he's helped hundreds of people land data jobs at
companies including Apple, Lyft, and Citadel.

About Nick Singh


Nick Singh started his career as a Software Engineer on Facebook’s Growth Team, and most recently,
worked at SafeGraph, a geospatial analytics startup. He holds a degree in Systems Engineering with a
minor in Computer Science from the University of Virginia. [n college, he interned at Microsoft and
on the Data Infrastructure team at Google's Nest Labs. His career advice has been read by over 10
million people on Linkedln.

All rights reserved. No part of this book may be used or reproduced in any manner without written
permission except in the case of brief quotations in critical articles or reviews.

The author, and/or copyright holder, assume no responsibility for the loss or damage caused or
allegedly caused, directly or indirectly, by the use of information contained in this book. The authors
specifically disclaim any liability incurred from the user or application of the contents of this book.
Throughout this book, trademarked names are referenced. Rather than using a trademark symbol
with every occurence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner, with no intention of infringement of the
trademark,

Copyright 202] Ace the Data Science Interview.


All rights reserved.
ISBN 978-0-578-97383-8
Praise for Ace the Data Science Interview
"The advice in this book directly helped me land my dream job.”
—Advitya Gemawat, ML Engineer, Microsoft

"Super helpful career advice on breaking into data, and landing your first job in the field.”
— Prithika Hariharan, President of Waterloo Data Science Club
Data Science Intern, Wish

"Solving the 201 interview questions is helpful for people in ALL industries, not just tech!”
— Lars Hulstaert, Senior Data Scientist, Johnson & Johnson

"FINALLY! A book like Cracking the Coding Interview


but for Data Science & ML! “
— Jack Morris, Al Resident, Google

“The authors explain exactly what hiring managers look for –


a must read for any data job seeker.”
— Michelle Scarbrough
Former Data Analytics Manager, F500 Co.

"Ace the Data Science Interview provides a comprehensive overview of


the information an academic needs to transition into industry. I highly
recommend this book to any graduate student looking to navigate
data science job interviews and prepare for the corporate sector.”
— Lindsay Warrenburg, PhD; Data Scientist, Sonde Health I-lead of Dala Science Projects at
The Erdös Institute

"Whal I found most compelling was the love story that unfolds through the book. From
the first date to the data science interview, Ace reveals his true character and what
follows is incredible. I 'm thrilled by their avant garde style that uses career
advice as a vehicle for fictional narrative, Once you pick up on it, you feel
as though you 're in on a secret even the authors weren’t aware of! "
—Naveen lyer, Former Machine Learning Engineer, Instagram

"Covers pretty much every topic I 've been tested on during data science
& analytics interviews.”
— Jeffrey Ugochukwu, Dala Analyst Intern, Federal Reserve; UC Davis Statistics' 23

"An invaluable resource for the Data Science & ML community.”


—Aishwarya Srinivasan, Al & ML Innovation Leader, IBM
"Highly recommend this for aspiring or current quantitative finance
professionals.”
—Alex Wang, Portfolio Analytics Analyst, -BlackRock

"I strongly recommend this book to both data science


aspirants and professionals in the field.”
—Chirag Subramanian, Former Data Scientist, Amwins
Group
Georgia Tech, MS in Analytics' 23

“Perfectly covers the many real-world considerations which ML


interview questions test for, "
—Neha Pusarla, Data Science Intern, Regeneron
Columbia, MS in Data Science' 21

"Amazing tips for creating portfolio projects, and then


leveraging them to ace behavioral
interviews.”
— Jordan Pierre, ML Engineer, Nationwide Insurance

"Nick, Kevin, and this book have been extremely helpful resources
as
I navigate my way into the world of data science.”
— Tania Dawood, USC MS in Communications Data Science '23

"Excellent practice to keep yourself sharp for Wall Street quant and
data science interviews! "
— Mayank Mahajan, Data Scientist, Blackstone

"The authors did an amazing job presenting the frameworks for solving
practical case study interview questions in simple, digestible terms.”
. — Rayan Roy, University of Waterloo Statistics' 23

"Navigates the intricacies of data science interviews without getting lost in them.”
— Sourabh Shekhar, Former Senior Data Scientist,
Neustar & American Express
For my family: Bin, Jing, Matt, and Allen

~ Kevin

For Mom and Dad, Priya and Dev; and my brother, Naman—
My family, who supports me in every endeavor

~ Nick
Table of Contents
Introduction………………………………………………………………………………………………………………………….. vii-x

Career Advice to Ace the Data Science Job Hunt


Chapter 1
4 Resume Principles to Live by for Data Scientists……………………….….……. 1-10
Chapter 2
How to Make Kick-Ass Portfolio Projects……………………………………..……………. 11-
16
Chapter 3
Cold Email Your Way to Your Dream Job in Data…………………………….…………. 17-
26
Chapter 4
Ace the Behavioral Interview…………………………………….…………………………….. 27-34

Ace the Technical Data Science Interview


Chapter 5
Probability ……………………………………………………………………………..……………….. 35-
52
Chapter 6
Statistics……………………………………………………………………………………………….…. 53-76
Chapter 7
Machine Learning…………………………………………………………………………………………………………..77-140
Chapter 8
SOL & DB Design1……………………………………………………………………..…………. 141-
180
Chapter 9
Coding…………………………………………………………………………………………….…...181-232
Chapter 10
Product Sense ......................................................................................... 233-270
Chapter 11
Case Studies…………………………………………………………………………………….….. 271-290

Introduction
Data scientists are not only privileged to be solving some of the most intellectually stimulating and
impactful problems in the world – they’re getting paid very well for it too. At Facebook, the median
total compensation reported by Levels.fyi for a Senior Data Scientist is a whopping $253,000 per
year. According to data on Glassdoor, the median base salary for a Quantitative researcher at hedge
fund Two Sigma is $173,000 per year, with opportunities to DOUBLE take-home pay thanks to
generous performance bonuses.
given how intellectually stimulating and down lucrative data science is, it shouldn't be a surprise that
competition for these top data jobs is fierce. Between “entry-level” positions in data science weirdly
expecting multiple years of experience, and these entry-Level jobs themselves being relatively rare in
this field saturated with Ph.D holders, early-career data scientist face hurdles in even landing
interviews at many firms.
Worse, job seekers at all experience levels face obstacles with online applications, likely never
hearing back from most jobs they apply to. sometimes, this is due to undiagnosed weakness in a
data scientist's resume, causing recruiters to pass on talented candidates. But often, it's simply
because candidates aren't able to stand out from the sea of candidates an online job application
attracts. Forget about acing the data science interview ̶̶ given the amount of job hunting challenges
a data scientist faces, just getting an interview at a top firm can be considered an achievement itself.
Then there's the question of actually passing the rigorous technical interviews. In an effort to
minimise false positives (aka "bad hires"), top companies run everyone from intern to industry
veterans through tough technical challenges to filter out weak candidates. These interviews cover a
lot of topics because the data scientist role is itself so nebulous and varied ̶̶ what one company calls
data scientist, another company might call a data analyst, data engineer, or machine learning
engineer. Only after passing these onerous technical interviews - often three or four on same the day
̶̶ can you land your dream job in data science.
We know this all must sound daunting. Spoiler alert: it is!

The good news is that in this book we teach you exactly how to navigate the data science job search
so that you can land more interviews in the first place. We've put together a shortlist of the most
essential topics to brush on as you prepare for your interviews so that you can ace these tough
technical questions. Most importantly, to put your technical skills to the test, we included 201
interviews questions from real data scientist interviews. By solving actual problems from FANG
companies, Silicon valley darlings like Airbnb and Robinhood, and wall street films like Two Sigma
and Citadel, we're confident our book will prepare you to ace the data science interview and help
you land your dream job in data.
Who are we?
Who are we, and how'd we find ourselves writing this book?

I (Nick) have worked in various data-related roles. My first internship was at a defence contractor,
CCRi, where I did data science work for the U.S. Intelligence Community. Later in college, I interned
as a software engineer at Microsoft and at Google's Nest Labs, doing data infrastructure engineering.
After graduating from the University of Virginia with a degree in systems engineering, I started my
full-time career as a new grad software engineer on Facebook's Growth team. There, I implemented
features and ran A/B tests to boost new user retention.
After Facebook, I found myself hungry to learn the business side of data, so I joined geospatial
analytics startup SafeGraph at their first marketing hire. There I helped data science and machine
learning teams at fortune 500 retailers, hedge funds, and ad-tech startups learn about SafeGraph's
location analytics datasets.
On the side, I started to write about my career journey, and all the lessons I learned from being both
a job applicant and an interviewer. Ten million views on LinkedIn later, it's obvious the advice struck
a nerve. From posting on LinkedIn, and sending emails to my tech careers newsletter with 45,000
subscribers, I've been privileged to meet and help thousands of technical folks along their career
journey. But, there was a BIG problem.
As a mentor, I was able to point software engineers and product managers to many resources for
interview prep and career guidance, like Cracking the Coding Interview, Cracking the PM Interview
and LeetCode. But, from helping data scientists, I realized just how confusing the data science
interview process was, and how little quality material was out there to help people land jobs in data.
I reached out to Kevin to see if he felt the same way.
You might be wondering...

Why'd I turn to Kelvin?


For several reasons! We're longtime friends, having attended high school together in Northern
Virginia at Thomas Jefferson High school for Science and Technology. Though we went our separate
ways for college, we became close friends once again after becoming roommates in Palo Alto,
California, When we both worked for Facebook as new grads, and bond over our shared love of
Drake.
By living with Kelvin, I learned firsthand three things about him:
1. Kelvin is an expert data scientist.
2. Kelvin loves helping people.
3. Kelvin is a fantastic freestyle rapper.

Because my rapping skills paled in comparison to Kelvin's, and I can't sing worth a damn (even
though my last name is Singh), it made sense to delay the mixtape and instead focus on our other
shared passion: helping people navigate the data science job hunt.
Kelvin has successfully landed multiple offers in the data world. It started in college, when he
interned on the Ad Fraud team at Facebook. After graduating from the University of Pennsylvania
with a major in computer science, and a degree in business from Wharton, Kelvin started his career
as a data scientist at Facebook, where he worked on reducing bad actors and harmful content on the
Facebook Groups' platform. After a year, Wall Street came calling. Kelvin currently works as a data
scientist at a hedge fund in New York.
On the side, Kelvin Combined his passion for data science and helping people, which led him to
found DataScienceprep.com, become a course creator on DataCamp, and coach dozens of people in
their data science careers.
vii
Ace the Data Science Interview results from Kevin is and my experience working in Silicon Valley and
Wall Street, the insights we've garnered from networking with recruiters and data science managers,
our personal experience coaching hundreds of data scientists to land their dream role, and our
shared frustration with the career and interview guidance that's been available to Data Scientists —
that is, until now!

What Exactly Do We Cover?


We start with a "behind the scenes" look at how recruiters and hiring managers at top companies
evaluate your resumes and portfolios so you can see what it takes to stand out from the rest. After
reviewing hundreds of resumes, we've seen technical folks make the same mistakes over and over
again. But not you, after you follow the advice in Chapter l .
In Chapter 2, we show you how to make kick-ass portfolio projects, These projects will leap off the
resume, and will make any person reading your application want to interview you. A well-crafted
portfolio project will also help you ace the behavioral interview.
But how do we get folks to read your application in the first place?
In Chapter 3, we teach you how to cold-email your way to your dream job in data. We give you a
new way of finding a job so that you don't have to keep applying online and getting ghosted. By
getting to the top of the recruiter's and hiring manager's email inbox, you'll get noticed, start the
networking process early, and often get an inside scoop into the role.
Next comes Chapter 4: Ace the Behavioral Interview. While there's no one right answer to "tell me
about yourself' or "do you have any questions for us," there are plenty of wrong ways to approach
these conversations. Learn how to avoid these mistakes, craft a better personal story, and tailor your
answers, so that the interviewer is left thinking you're born for the role.
Finally, we're ready for the trickiest part of the data science job hunt, and the meat of our book:
acing the technical data science interview. Chapters 5—10 give you an overview of the common
technical subjects asked during data science interviews. We detail what to brush up on and what to
skip, for topics within Probability, Statistics, Machine Learning, SQL & Database Design, Coding, and
Product Sense. In Chapter 11 — the boss chapter — we cover how to approach open-ended case
questions, which blend multiple topics into one big problem.
Each of these technical chapters also tests your knowledge with real interview questions asked by
tech behemoths like Facebook, Google, Amazon, and Microsoft, mid-sized tech companies like
Stripe, Robinhood and Palantir and Wall Street’s biggest banks and funds like Goldman Sachs, Two
Sigma, and Citadel. With problems organized into easy, medium, and hard difficulty levels, there is
something to learn for everyone from the data science neophyte all the way to a Kaggle champion. If
you get stuck, there's nothing to fear, as each problem has a fully worked-out solution too.
Additional Resources to Accompany the Book
Alongside reading this book, you are 94.6% encouraged to join our Instagram community at
instagram.com/acedatascienceinterviews for additional career tips, interview problems, memes, and
the chance-to see our glowing faces from time to time when we flex on the gram.

ix

Also, make sure you've subscribed to Nick's monthly career advice email newsletter:
nicksingh.com/signup
It's just one email a month with the latest tips, resources, and guides to help you excel in your
technical career.
And speaking of email, if you have suggestions, find any mistakes, have success stories to share, or
just want to say hello, send us an email: [email protected] or feel free to
connect with us on social media.

Nick Singh
nicksingh.com — where I blog my long-form essays and career guides
Linkedin.com/in/Nipun-Singh — where I share career advice daily (please send a connection
request with a message that you've got the book. I'm close to the 30k connection limit so don't
want to miss your connection request!)
instagram.com/DJLilSingh — for a glimpse into my hobbies (DJing and being Drake's #1 Stan)
twitter.com/NipunFSingh — for tweets on careers and tech startups

Kevin Huo
linkedin.com/in/Kevin-Huo
instagrarn.com/Kwhuo
4 Resume Principles
to Live by for Data Scientists
CHAPTER 1

Before you kick off the job hunt, get your resume in order. Time and effort spend here can
pay rich dividends later on in the process. No one is going to grant you an interview if your
resume doesn't scream success and competence. so here are four principles your resume
should live by, along with miscellaneous hacks to level up. We even include our actual
resumes from our senior years of college to show you how, in real life, we practiced what we
preach to land our dreams jobs at Facebook.

Principle #1: The Sole Purpose of Your Resume is to Land


an Interview
No resume results in an immediate job offer, that isn't its role. what your resume must do is convince
its recipient to take a closer look at you. During the interview process, your data science chops and
people skills will carry you towards and offer. Your resume merely opens the door to the interview
process. In practice, that means keeping your resume short! One page if you have under a decade of
experience, and two pages if more. Save whatever else you want to say for your in-person interview
when you will be given ample time to get into the weeds and impress the interview with your
breadth of knowledge and experience.

Since the main aim of your resume is to land an interview, the best way to do that is by highlighting a
few of your best achievements. it's crucial that these highlights are as easy and as obvious as
possible to find so that the person reading your resume decides to grant you an interview. Recruiters

1
are busy people who often only have 10 seconds or less to review your resume and decide whether
to give you an interview. Keeping the resume short and removing fluff is key to making sure your
highlights shine through in the short timespan when a recruiter is evaluating your resume.
One way to shorten the resume and keep it focused on the highlights is by omitting non-relevant
jobs. Google doesn't care that you were a lifeguard three summers ago. The exception to this rule is
if you have zero job experience. If that's the case, do include the job on your resume to prove that
you've held a position that required a modicum of "adulting.”
Another way to make sure your resume lands you an interview is by tailoring the resume to the
specific job and company. You want the recruiter reading your resume to think you are a perfect fit
for the position you 're gunning for. For example, say you worked as a social media assistant part-
time three years ago promoting your aunt's restaurant. This experience isn't relevant for most data
science roles and can be left off your resume, But if You're applying to Facebook and interested in
ads, it's worth including. IF you're applying to be a data analyst on a marketing analytics team, it can
help to leave in the social media marketing job.
Another resume customization example: when I (Nick) applied to government contractors with my
foreign-sounding legal name (Nipun Singh), I put "U.S. Citizen" at the top of my resume. This way, the
recruiter knew I had the proper background to work on sensitive projects, and could be eligible later
for a top secret security clearance.

Principle #2: Build Your Resume to Impress the Recruiter


The person you most need to impress with your resume is a nontechnical recruiter. The senior data
scientist or hiring manager at the company you want to work for is NOT your target audience — they
will have a chance to review your experience in-depth during your on-site interview. As such, spell
out technical acronyms if they aren't obvious. Give a little background; don't just assume the
recruiter will understand what you did and why it's impressive, For example, a research project
called "Continuous Deep Q-Leaming with Model-Based Acceleration" doesn't make sense to most
people. But a "Flappy Bird Bot Using Machine Learning" is more memorable and intriguing to the
average nontechnical recruiter.
Do you know what else recruiters love (along with hiring managers and execs)?
Numbers. BIG numbers!
Don't be afraid to report usage or view counts for your projects. Talking about user metrics shows
that you drove a project to completion, and got it in front of real people who were impacted by your
work. Even if the project is technically straightforward, real user or view counts go a long way in
helping a recruiter understand you made something of value
For example, in college, I (Nick) made RapStock.io. The website didn't have a sleek UI, and the
underlying code wasn't complex, but at its peak, it had 2,000 Monthly Active Users. This experience
opened many doors and gave me a great story to tell because recruiters realized I'd actually shipped
something of value before.
An even better way to impress the recruiter is if you can quantify your impact in business terms:
write out the specific dollars earned, or dollars saved, due to your work, Most recruiters would
prefer to read about the analysis you did that led to $20,000 in savings for a business rather than
that small little side Project where you solved P vs. NP.

Ace the Data Science Interview 2


Why? Because it's hard to explain the P vs. NP problem and why it matters on a resume, unless you
describe it with secondary results like, ' 'Won a Nobel Prize." But $20,000 cash is $20,000 cash; no
explanation needed. And because most data science job seekers are applying to businesses, talking
about real value you've generated in the past gives businesses confidence that you'll do the same at
their company. The truth is that, ultimately, what the recruiter is looking for isn't necessarily an
expert data scientist per se, but someone who will move the business and product forward — who
just so happens to do it using data science!

Principle #3: Only Include Things That Make You Look Good
Your resume should make you shine, so don't be humble or play it cool. If you deserve credit for
something, put it on your resume. But never lie or overstate the truth. It’s easy for a recruiter or
other company employee to chat with you about projects and quickly determine if you did it all or if
it was a three-person group project you're passing off as your own. And technical interviewers love
to ask probing questions about projects, so don't overstate your contributions.
Another way to look good is to not volunteer negative information. Sounds obvious enough, but I've
seen people make this mistake Often. For example. you don't have to list your GPA! Only write down
your GPA if it helps. A 3.2 at MIT might be okay to list, but a 3.2 at a local lesser-known college might
not be worth listing if you are applying to Google, especially if Google doesn't usually recruit at your
college, Why? Because Google being Google might be expecting you to be at the top of your class
with a 4.0 coming from a non-target university; As a result, a 3.2 might look bad.

Avoid Neutral Information


A mistake more common than volunteering negative information is adding neutral details to your
resume. Drowning out your accomplishments with irrelevant information detracts from your resume
in much the-Same way as volunteering negative information. Remember: the average recruiter will
only spend around 10 seconds skimming your resume, so you can't afford for them to be distracted
and miss your strongest points.
One big source of neutral information is the summary or objective section at the lop. Your aim must
be to look exceptional, not typical or neutral. How boring and undifferentiated is: "hard-working,
results-oriented analytics professional looking for a Data Science job starting fall 2022." Worse, this
section is usually right at the top of a resume, taking up valuable real estate. Get rid of it completely!

You Probably Don't Need a Skills or Technologies Section


Another section of the resume packed with neutral details is the skills and technologies section.
Traditional resume advice says to jampack keywords in here. We disagree. You can eliminate this
section entirely or shorten it to two lines max (if you do choose to include it). There are several
reasons why we advocate shortening or removing the skills and technologies section.
First Off, we need to address why traditional resume advice advocates for including this section. The
reasoning is that to please the application tracking system (ATS), which has algorithm that flags a
recruiter that your application is relevant to the job description you applied for, you need to stuff
your resume with keywords. We don't agree with this advice. As we detail in Chapter 3, applying
online is an ineffective strategy, so pleasing the ATS isn’t paramount, Also, anything you list on your
resume is fair game for the interviewer to ask you about. Filling this section with tools you aren’t
familiar with to please the ATS algorithm can easily backfire come interview time.

Ace the Data Science Interview 3


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Another reason to get rid of the skills section: remember "Show and Tell" in grade school? Well, it's
still way better to show rather than just to tell! Include the technologies inline with your past work
experience or portfolio projects. Listing the tech stack this way contextualizes the work you did and
shows an interviewer what you're able to achieve with different tools. Plus, in explaining your
projects and past work experiences, you'll have enough keywords covered to appease the ATS which
traditional career advice is overly focused on.
The last reason to ditch a big skills and technologies section is that you are expected to learn new
languages and frameworks quickly at most companies. The specific tools you already know are
helpful but not crucial to landing a job. You are expected to be an expert at data science in general,
not in specific tools. Plus, at large tech companies like Facebook and Google, the tools and systems
are often proprietary. Thus, it doesn't matter much about the specific tools you know— it's about
what you've actually accomplished with those tools in the past.

Mistakes College Students and New Grads Often Make


Listing too many volunteer experiences and club involvements from high school and college is a
frequent mistake we see college students and new grads make on their resume. They still think it's
like college applications, where listing your involvement in varsity soccer, piano lessons, and the
National Honor Society means something. It's great that you are a civically engaged, respectable
human involved with your community, but competitive tech companies and Wall Street firms are
selecting you for your ability to turn data into insight, and not much other than that. Attending ACM
meetings or going to the data science club is practically worthless as far as resume material goes.
Unless your involvement was significant, don't list it.
Another source of potentially irrelevant details: things from a long time ago, like your SAT score, or
which high school you attended. One caveat that traditional career services folks don't tell you to do:
leave in details from high school if they are exceptional.
Got more than a 2350 (or 1450) on your SAT? Leave it in!
Did well in USAMO, ISEF, or Putnam? Leave it in!
Are you a NCAA athlete or attended college on a full-ride merit scholarship? Leave it in!
Same goes if you attended a prestigious high school like Phillips Exeter, Harker, or Thomas Jefferson
High School for Science & Technology (go Colonials!).
We've found that at elite tech companies and Wall Street firms, the interviewers went to these same
schools and won these same competitions. It may be okay to keep on your resume even if it's from a
long while ago, provided it doesn't take too much space.
And if none of this applies to you, don't worry —there's so much more to you and your resume than
the brand names you've listed.

Principle #4: Break Formatting and Convention to Your Favor


We believe traditional resume advice is too rigid, with a one-size-fits-all approach. If breaking
conventional writing rules makes your resume more readable, then do it!
Typically, bold font is reserved for section headings. But feel free to bold any words or phrases that
make your resume easier to understand so the interviewer can decide — in 10 seconds or less! — to
interview you.

Ace the Data Science Interview 4


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

For example, back in college, I (Nick) made the size of the companies I worked at bigger than other
text. This way, while scanning my resume, a recruiter could quickly see and say, "Microsoft Intern.
Check. Google Intern. Check. Okay, let's interview this kid." I also bolded the User metrics for
RapStock.io on my resume. I wanted to quickly call out this information because it was sandwiched
between additional details about my projects.
Another resume convention you can break to your favor is section order. There is no hard and fast
rule on ordering your education, work experience, projects, and skills sections. Just remember: in
English, we scan from top to bottom. So list your best things at the top of the resume.
Keeping what's most important up top is an important piece of advice to remember, since it may
conflict with advice from many college career advisors Who suggest listing your university at the
top. For example, if you currently attend a small unknown university, but interned at Uber this
summer, don’t be afraid to list your work experience with Uber at the top, and move education to
the bottom. Went to MIT but have no relevant industry internships yet? Then it's fine to Icad with
your education, and not the work experience section.
Another resume role on ordering you can break: items within a section don't have be in
chronological order! For example, had a friend who interned at Google one summer, then interned
part-time at a small local startup later that fall, It's okay to keep Google at the top of the resume,
ahead of the small local startup, even though the startup experience was more recent work
experience. List first what makes you look the best.
Another convention you can safely break is keeping standard margins and spacing. Change margins
to your favor, to give yourself more space and breathing room. You can also use different font sizes
and spacing to emphasize other parts of your resume. Just don't use margin changes to double the
content on your one-page resume. If you do, you're likely drowning out your best content, which is a
big no-no(as mentioned in Principle #3).

Oh, and speaking of breaking the resume rules: you can ignore any of the tips I've listed earlier in
this chapter if it helps you tell a better Story on your resume. Earlier, I mentioned that listing
irrelevant jobs — like being a waiter — won't help you land a data science gig. But go ahead and list
it if, for example: you were a waiter who then built a project that analyzed the data behind
restaurant food waste; Same way, there's nothing wrong with listing the waiter position if you're
applying to DoorDash or Uber Eats. If listing something helps tell the story of you, leave it in.
So don't listen to folks who tell you that linking to your SoundCloud from your resume is
unprofessional. Suppose you've made a hackathon project around music, or are applying to
company like Spotify, In that case, it's perfectly fine to list the SoundCloud link since it shows a
recruiter that you followed your passions and created projects to further your interests. And by the
way; if your mixtape is fire, please email us the link to [email protected] and
we'll give it a listen.

Miscellaneous Resume Hacks


Make Your Name and Email Prominent
Self-explanatory. Near your name, add a professional email address too. Also, do not use Yahoo mail
or Hotmail addresses. Silicon Valley tech folks are especially judgy about this.

Never Include Your Mailing Address


Companies are biased toward hiring local candidates because they don't need to pay relocation fees.

Ace the Data Science Interview 5


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

And recruiters are compensated based on the number of candidates they can close. So, put yourself
in a recruiter's chair. Let's say you're a Silicon Valley-based company recruiter with two identically-
skilled candidates, but one lives in the Bay Area and the other lives in NYC. Which of the two are you
more likely to close? The NYC candidate who needs to decide to move to SF and uproot her family
before accepting your offer or the local person who can take your offer and start next week?
Don't List Your Phone Number Unless It's Local
Because of the spam robocall epidemic, anyone who calls you will email you first to ask for your
phone number and set up a time. So, there's no need to list it. Remember: you have 10 seconds to
rivet someone reading your resume. Don't waste a second of their time by presenting nonessential
information. Plus, if your phone number is international, it'll hurt even more, as often there's a bias
to hire local candidates. And yes, hiring managers and recruiters do notice when your number is
from a far-flung area code.

Include Your GitHub


Your GitHub link doesn't need to be super prominent. It's okay if your GitHub is a bit messy. Merely
having a GitHub listed is a sign that you have done work out in the open and that you're aware of
version control. And remember: since the first interview gatekeeper is a nontechnical recruiter, he or
she most likely won't be able to tell a messy GitHub from an okay one anyway.

Hyperlink to Outside Work Where Possible


Include links to your data analysis blogs or hackathon projects whenever possible. Linking to the
outside world validates to the recruiter that you made something and published it to the real world.
Even if the recruiter never clicks on a link, just seeing the blue hyperlink will give them a sense that
there's something more there.

Save Your Resumes as PDFs with Good Names


PDFs maintain formatting better and lead to better viewing experiences on mobile.
And be sure to save your resumes as "First Name Last Name Company Name Resume."
When people name it "Resume.pdf," it implies you 're applying to many jobs willy-nilly. You should
be customizing your resumes for specific companies and roles anyway, so saving them with the
company name included will help you be more organized.

Resume Hacks for IRL


The first hack for sharing a resume in real life is to print your resume on heavier paper. Doing this will
make the resume stand out, and it will feel more professional. Next, bring your resume to every
place where you'll be meeting hiring managers and recruiters: job fairs, coffee chats, conferences,
meetups, and, of course, onsite interviews.
You might be skeptical of the value of doing this, especially since your LinkedIn and personal website
might have all the same information, but here's the main reason: you don't want to have to pull
these up on your phone if someone is curious. In many contexts, it just is way easier to hand them a
piece of paper.
And there's a subtle reason for doing this too. Even if the person you are trying to network with
doesn't read your resume in the moment, they'll likely hang onto it and read it later while waiting for
their Uber or when they're stuck at the airport. It's a physical memento you gave them, and they'll
be more likely to remember you and respond to your email follow-ups later (covered in Chapter 3).

Ace the Data Science Interview 6


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

And lastly, since you're carrying your resume everywhere, get a folio to carry it. Crumpled resumes
look unprofessional. If you are in college or are a new grad, and went to a top school, get a leather
padfolio with the school logo front and center for the subtle flex.

Nick's Facebook Resume (Senior Year, Fall Semester)


Nick (Nipun) Singh
[email protected], nipunsingh.com, github.com/NipunSingh

Experience:_________________________________________________________________
Google/Nest Labs, Software Engineering Intern May-Aug 2016
On the Date Infrastructure team, built a monitoring & deployment tool for GCP Dataflow
jobs in Python (Django)
Wrote Spark jobs in Scala to take Avro-formatted data from HDFS and published it to
Google's Pub-Sub services
Microsoft, Software Engineering Intern May-Aug 2015
Reduced latency from 45 seconds to 80 milliseconds for a new monitoring dashboard for
payments team
Did the above by developing an effcient ASP.NET Web API in C# which leveraged caching
and pre-processing of payment data queried from Cosmos via Scope (Microsoft internal
versions of Hadoop File System and Hive)
CCRL Data Science Intern Jun-Aug
2014
Worked on a NLP algorithm for a contract with the Office of Naval Research
Improved F1 measure of algorithm 70% compared to the original geo-location algorithm
used by Northrup Grumman, by designing new algorithm in Scala which used Stanford
NLP package to geo-locate events in news
Projects:____________________________________________________________________
Founder, Rapstock.in Jan-My 2015
Grew site to 2,000 Monthly Active Users, and received 150,000 page views
Developed using Python (Django), d3.jS, JQuery, Bootstrap, PostgreSQL, and deployed to
Heroku
Game similar to fantasy football but players bet on the real world popularity and
commercial success of rappers the rappers' performance is based on metrics scraped
from Spotify and Billboards
"Great to see that folks stick around" - Alexis Ohanian, Founder of Reddit, commenting
on our retention metrics

Ace the Data Science Interview 7


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

PennApps Hackathon.Autocaption.com Sep 2015


Won 'Best Use Of Amazon Web Services' award at the 2,000 person Hack-a-thon
Created a website which takes in a photo and then automatically captions it.
Used IBM Watson API and Image recognition for tagging. Scraped 7,000 jokes and quotes with
Python. Used and tuned AWS Cloud Search to search the database of captions. Backend built
with Django.
Education:__________________________________________________________________
B.S. Systems & Information Engineering, Minor in Computer Science and Applied Math
Rodman Scholar (Top 5% of Engineering Class)
Thomas Jefferson high School for Science and Technology
SAT: 2360 (800 Math, 800 Writing, 760 Reading)
Technologies
Languages: Python (Django), Java, Scala, R
Other: PostgreSQL, Spark, Google Cloud Platform, AWS, Heroku
Entrepreneurial Activities
Entrepreneurship Group at UVA Jan 2014-Aug
2015
Relation officer for a club with 130 active members: planned speaker events and pitch
competitions
Jan 2012-Aug
2013
Performed at 25 events as a Hip-Hop & Bollywood D], providing music, lighting, & MC services

Kevin's Facebook Resume (Senior Year, Fall Semester)

Kevin Huo
EDUCATION
University of Pennsylvania - Philadelphia, PA Graduating: May 2017
The Wharton School: BS in Economics with concentrations in Statistics & Finance
School of Engineering and Applied Sciences: BSE in Computer Science
GPA: 3.65/4.00
Honors: Dean's List (2013-2014), PennApps Health Hack Award (2014)

Ace the Data Science Interview 8


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Statistics Coursework: Modern Data Mining, Statistical Inference, Stochastic Process, Probability,
Applied Probability Modeling

Computer Science Coursework: Algorithms, Data Structures, Automata and Complexity,


Databases, Intro to Data Science, Software Engineering, Machine Learning
Finance Coursework: Investment Management, Derivatives, Monetary Economics
Thomas Jefferson High School for Science and Technology - Alexandria, VA Graduated: June 2013
GPA: 4.44/4.00 (Weighted), SAT: 2350 (Math-800, Writing-800, Reading-750)
Honors: National Merit Finalist, National AP Scholar, American Invitational Math Exam (AIME)
Qualifier
ACTIVITIES
Wharton Undergraduate Data Analytics Club (Team Leader) January 2016-January 201 7
Participated in speaker series, tech talks, and data hackathons as a general member
Project leader for consulting group performing analyses
WORK EXPERIENCE
Facebook (Data Science Intern) June 2016-August 2016
Analyzed fraud within Atlas by looking at edge cases among existing systems
Built cost view of fraud for advertisers and presented recommendations to relevant teams

Technologies used: SQL, R, Python

Bloomberg LP (Software Engineering Intern) May 2015-August


2015
Developed a contributor analysis tool for the Interest Rate Volatility Team
Constructed various statistical metrics to gauge contributors

Built UI component using JavaScript and in-house technologies

Wrote various Python scripts to monitor metrics


Technologies used: Python, JavaScript
Zetta Mobile (Software Engineering Intern) June-August 2014
Wrote Python scripts to compile recorded data from logs of mobile advertisements
Used R to look for useful trends and patterns in the compiled data
Built scripts to automate the data analysis of click-through rate
Scripts are being used in beta-testing for future automated data analyses for the company to use

Technologies used: Python

Computer Science Teaching Assistant January 2014-December

2016

Held weekly recitation and office hours and responsible for grading of homework, tests, and
quizzes

Ace the Data Science Interview 9


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Discrete Math (Spring & Fall 2014), Data Structures & Algorithms (Spring/Fall 2015 & 2016)

LANGUAGES/FRAMEWORKS
Proficient: Python, R, SQL, Java, Familiar: Javascript, HTML/CSS, Basic: OCaml, Hadoop, Linux

Ace the Data Science Interview 10


How to Make Kick-Ass
Portfolio Projects
CHAPTER 2

Unanimously, data science hiring managers have told us that not having portfolio projects
was a big red flag on a candidate's application. This holds true especially for college students
or people new to the industry, who have more to prove. From mentoring many data
scientists, we 've found that having kick-ass portfolio projects was one of the best ways to
stand out in the job hunt. And from our own experience, we know that creating portfolio
projects is a great way to apply classroom knowledge to real-world problems in order to get
some practical experience under your belt. Whichever way you slice it, creating portfolio
projects is a smart move. In this chapter; you'll learn 5 tips to level-up your data science and
machine learning projects so that recruiters and hiring managers are jumping at the chance
to interview you. We teach you how to create, position, and market your data science
project. When done right, these projects will give you something engaging to discuss during
your behavioral interviews. Plus, they'Il help make sure your cold emails get answered
(Chapter 3).

The High-Level Philosophy


As we discussed to death in Chapter 1, the recruiter is the person we need to impress because they
are the interview gatekeeper. A recruiter won't dive deep into your Jupyter Notebook, look at line
#94, see the clever model you chose, and then offer you an interview. That's not how recruiting (or
people!) work.

Ace the Data Science


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

The majority of recruiters just read the project description for 10 seconds in the cold email you send
them or when reviewing your resume. Maybe — if you're lucky — they click a link to look at a
graphic or demo of the project. At this point, usually in under 30 seconds, they think to themselves,
"This is neat and relevant to the job description at hand," and decide to give you an interview.
Thus, we're optimizing our data science portfolio projects to impress the decision-maker in this
process — the busy recruiter. We're optimizing for projects that are easily explainable via email.
We're optimizing for ideas that are "tweetable": ones whose essence can be conveyed in 140
characters or less. By having this focus from day one when you kick off the portfolio project, you will
skyrocket your chances of ending up with "kick-ass" portfolio project that gets recruiters hooked.
Don’t worry if you think that focusing on the recruiter will cheapen your portfolio project's technical
merits. Believe us: the technical hiring manager and senior data scientists interviewing you will also
appreciate how neatly packaged and easily understandable your project is. And following our tips
won't stop you from making the project technically impressive; an interesting and understandable
project does not need to come at the expense of demonstrating strong technical data science skills.

Tip #1: Pick a Project Idea That Makes for an Interesting Story
Recruiters and hiring managers are human. Human beings love to hear and think in terms of stories.
You can read the book that's quickly become a Silicon Valley favorite, Sapiens: A Brief History of
Humankind, by Yuval Harari, to understand how fundamental storytelling is to our success as a
species. In the book, Harari argues that it's through the shared stories we tell each other that Homo
sapiens are able to cooperate on a global scale. We are evolutionarily hardwired to listen, remember,
and tell stories. So do yourself a favor and pick ideas to work on which help you tell a powerful story.
A powerful story comes from making sure there is a buildup, then some conflict, and a nice,
satisfying resolution to said conflict. To apply the elements of a story to a portfolio project, make
sure your work has some introductory exploratory data analysis that builds up context around what
you are making and why. Then pose a hypothesis, which is akin to a conflict. Finally, share the verdict
of your posed hypothesis to resolve the conflict you posed earlier. By structuring your portfolio like a
story, it'll be easier to talk more eloquently about your project in an interview. Plus, the interviewer
is hardwired to be more interested — and therefore more likely to remember you and your project
— when you tell it in a format that we're hardwired to love.
So, how do you discover projects that will translate into captivating stories?
Looking at trending stories in the news is a great starting point because they are popular topics that
are easy to storytell around. For example, in the fall of 2020, the biggest news stories were the
COVID-19 pandemic and the 2020 U.S. presidential election. Interesting projects on these topics
could be to look at vaccination rates by zip code for other diseases, and see how they correlate to
demographic factors in order to understand healthcare inequities and complications with vaccine
rollout plans. For the 2020 U.S. presidential election, an interesting project would be to see what
demographic factors correlate highest for a county flipping from Donald Trump in 2016 to Joe Biden
in 2020, and then predicting which counties are the most flippable for future elections.
If you ever get stuck on these newsworthy topics, data journalism teams at major outlets like the
New York Times and FiveThirtyEight have already made a whole host of visualizations related to
these issues. These can serve as inspiration or as a jumping-off point for more granular analysis.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Another easy source of ideas with good story potential is to think about problems you personally
face. You'll have a great story to tell where you're positioned as the hero problem-solver if you can
convey how annoying a problem was to you and that you needed to solve it for yourself and other
sufferers. I've seen friends at hackathons tackle projects on mental health (something they
personally struggled with), resulting in a very powerful and moving narrative to accompany the
technical demo.
Tip #2: Pick a Project Idea That Visualizes Well
A picture is-worth a thousand words. And a GIF is worth a thousand pictures. So go with a portfolio
project idea that visualizes well to stand out to recruiters. Ideally, make a cool GIF, image, or
interactive tool that summarizes your results.
I (Nick) saw the power of a catchy visualization firsthand at the last company I worked at, SafeGraph,
when we launched a new geospatial dataset. When we just wrote a blog post and put it on
SafeGraph's Twitter, we wouldn't get much engagement. But when we included a GIF of the
new dataset visualized, we'd get way more attention.
This phenomenon wasn't just isolated to social media — the power of catchy photos and GIFS even
extended to cold email. When we'd send sales emails with a GIF embedded at the top, we got much
higher engagement than when we'd send boring emails that only contained text to announce a
product. These marketing lessons apply to your data science portfolio projects as well, as you should
be emailing your work to hiring managers and recruiters (covered in detail in Chapter 3). You might
be thinking. "Why are we wasting time on this and not focusing on the complicated technical skills
that a portfolio project should demo?"
We want to-remind you: your ability to convey results succinctly and accurately is a very real skill.
Explaining your work and presenting it well is a great signal to companies, because real-world data
science means convincing stakeholders and decision makers to go down a certain path. Fancy
models are great, but not unless you can easily explain to higher ups their results and business
impact. A compelling visual is one of the easiest ways to accomplish that goal in the business world.
Demonstrating this ability through a portfolio project gives any interviewer confidence you'll be able
to excel at this aspect of data science when actually on the job.

Tip #3: Make Your Project About Your Passion


Making your portfolio project about your passion is a cheat code for a whole slew of reasons.
Passion is contagious. If you're having fun talking about your passion project, chances are those
same good vibes will catch on with the interviewer. Plus, when you work on something you're
naturally passionate about, it becomes much easier and more comfortable for you to talk about the
work during an otherwise nerve-wracking interview. This effortless communication will help you
come across as a more articulate communication — a highly desirable attribute for any hire. Making
your project about your passion, and then communicating this passion, also leads to a halo effect,
where you come across as passionate for related things, like the field of data science, the job at
hand, and the company. This passion and enthusiasm halo effect is especially crucial to create for
more junior data science candidates. Early-career data scientists require more hand holding and
resources invested by a company compared to experienced hires, From talking to hiring managers,
we found that they chose to invest in more junior candidates when the candidate displayed high
amounts of enthusiasm and passion. This enthusiasm and passion is a great signal that the junior
candidate will be motivated to learn quickly and close the skill gap fast. Thus, by signaling passion by

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

working on passion projects, you help make companies want to invest in you over a more senior
candidate who might have more technical skills but lacks the same interest in the field.
What does this advice mean in practice? If you love basketball, then use datasets from the NBA in
your portfolio projects. Passionate about music? Classify songs into genres based on their lyrics.
Binge Netflix shows? Take the IMDB movies dataset and make your own movie recommender
algorithm,
For example, I (Nick) — a passionate hip hop music fan and DJ that's always on the hunt for
upcoming artists and new music — made RapStock.io, a platform to bet on upcoming rappers. When
talking about the project to recruiters, it was effortless for me to come across as passionate about
data science and pricing algorithms because the underlying passion for hip hop music was shining
through.
Another benefit of working on a project related to your passion: it's less of a chore to get the damn
project over the finish line when work becomes play. And getting the project done is paramount to
your success, as we later detail in tip #5.

Tip #4: Work with Interesting Datasets


Don’t work with datasets that people have worked with in their school work, such as the classic Iris
Plant dataset or the Kaggle passenger survival classification project using the Titanic dataset —
they're overdone. Worse, working on these datasets likely means you are not working on something
you are passionate about, which goes against the advice in the previous tip. I stand corrected,
though, if you're a weirdo whose true passion is classifying flowers into their respective species
based on petal and sepal lengths.
Another reason to not work with these standard datasets is because the recruiter and hiring
manager will have seen this project done multiple times already. This situation occurs frequently if
you are a college student or in a bootcamp, and are trying to pitch a required class project as your
portfolio project. You can imagine how lame that comes across to recruiters during university
recruiting events to see the same project over and over again from everyone who took the same
class.
Kevin heard a recruiter complain about exactly this after she saw one too many Convolutional Neural
Nets trained with Tensorflow for handwriting recognition based on the MNIST Digit Recognition
Dataset. So, stay away from (his classic dataset, along with avoiding stock ticker data (unless you are
gunning for-Wall Street jobs) and the Twitter firehose data (unless you truly have a fresh take on
analyzing tweets).
To drive home how you can seek out interesting datasets that tell a good story and relate to your
passion, let's work with a concrete example. Suppose you love space and dream of working as a data
scientist at NASA. How would you find exciting datasets to help you break into your dream job?
Our first step would be to go on Kaggle and find NASA's Asteroid Classification challenge. Or we can
analyze the Kepler Space Observatory Exoplanet Search dataset. If we wanted to start more simply
and only knew Excel, we could look at the CSV of all 357 astronauts and their backgrounds and make
a few cool graphics about what their most common undergrad majors in college were, So much data
is out there just one Google Search away — you have no excuse to be working on something boring!
The upside of using a website like Kaggle to get datasets is that it's well-formatted, clean, and often
there ate starter notebooks exploring the data. The downside is that others may also be looking at it,

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

hurting your project's uniqueness. There are also some lost learning opportunities, since collecting
and cleaning data is a big part of a data scientist's work. However, if you find something you really
love on Kaggle, it's not a ,big problem. Go for it, and maybe later find a different, complementary
dataset to add another dimension to your project.
One way to tackle interesting datasets that are unique is to scrape the data yourself. Packages like
Beautiful Soup and Scrapy in Python can help, or R vest for R users. Plus it's an excellent way to
practice your programming skills and also show how scrappy you are, And since collecting and
cleaning data is such a large part of a real data scientist's workflow, scraping your own dataset and
cleaning it up shows a hiring manager you're familiar with the whole life cycle of a data science
project.
Tip #5: Done > Perfect. Prove You Are Done
As long as your work is approximately correct, the actual technical details don't matter as much for
getting an interview. Again, as mentioned above, a recruiter will not dig into your project and notice
that you didn't remove some outliers from the data. However, a recruiter can quickly determine how
complete a project is! So make sure you go the extra mile in "wrapping up a project." See if you can
"productionize" the project.
Turn the data science analysis into a product. For example, if your project was training a classifier to
predict age from a picture of a face, go the extra step and stand up a web app that allows anyone 10
upload a photo and predict their own age. As part two to the project, use a neural net to transform
the person's face to a different age, similar to FaceApp. Putting in this extra work, and then cold-
emailing the project to hiring managers, could be your ticket into companies like Snapchat,
Instagram, and TikTok.
If your project was less productizable and more exploratory, see if you can make an interactive
visualization that helps you tell a story and share your results. For example, lees say you did an
exploratory data analysis on the relationship between median neighborhood income and quality of
school district. To wrap this project up, try to make and host an interactive map visualization so that
folks can explore and visualize the data for themselves. I like D3.js for interactive charts and Leaflet,
js for interactive maps. rCharts is also pretty cool for R users. By creating a visualization, and then
sending this completed interactive map to hiring managers at Zillow or Opendoor you'll be able to
stand out from other candidates.
Lastly, your portfolio project isn't done until it's public, so make sure you publicly share your code on
GitHub. You can also use Google Collab to host and share your interactive data analysis notebook.
Even if no one sees the code or runs the notebook (which is likely!), just having a link to it sends a
signal that you are proud enough of your work to publish it openly. It also shows that you actually
did what you said you did and didn't just fabricate something to pad the resume.

Tip #6: Demonstrate Business Value


The best portfolio projects are able to demonstrate business value. Try to make it concrete, not
theoretical. This advice is crucial for PhDs breaking into industry, especially if the academic is trying
to break into smaller companies or startups. When you are applying to businesses, talk in business
terms. Show how your technical skills can drive business value. Try to make sure your project has a
crisp business recommendation or fascinating takeaway.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

If You can't point to exact metrics like dollars earned or time saved by creating the project, you can
instead put down usage numbers as a proxy for the amount of value you created for people. Plus,
mentioning view counts or downloads or active users helps demonstrate to a business that you drove
a project to completion. It’s okay to skip out on demonstrating business value IF you work with
interesting enough data and can tell a good story. An example project we find interesting, creative
and fun, but technically simple and not obviously a driver of real business value: A Highly Scientific
Analysis

Ace the Data Science Interview


of Chinese Restaurant Names. Send that project to recruiters at Yelp or DoorDash and watch the
interviews Come pouring in.
Double Down on One Portfolio Project to Breeze Through Your
Behavioral Interview
If the tips above seem like a lot of work, that's because they are. But the good news is you don't
need to do this for every project. Spending all your energy on having a single killer project is worth
as long as you end up with a lot to show for it, like beautiful graphics, a working demo, and some
usage statistics or metrics on business value created.
The other reason it's okay to focus your time and energy on creating a single kick-ass project is
because this can help carry you through a behavioral interview. Typical behavioral interview
questions start with "Tell me about a time..." or "Tell me about a project where...." By coming back
to the same project, you don't have to waste valuable time setting context or background. Instead
you can dive straight into answering the behavioral interview question Plus, focusing your and
energy into one project means you'll be able to tackle more challenging problems and apply more
advanced techniques. This is good because common behavioral interview questions include, "What
was the hardest data-related problem you tackled?"
An additional benefit to going deep into one project is that you're much more likely to create real
business value or gather users and views for your project. Trying to market and promote multiple
side projects is a recipe for disaster, because showing traction for a single project is hard enough for
most people. Sticking to one project makes it much more likely you'll discover some method 0 angle
to generate value and publicity.
Another reason why a well-crafted portfolio project allows you to breeze through your behavioral
interview is because one common question is, "Why this company?" or "Why this industry?"
Hopefully, you worked on something you are passionate about that is related to the company or
industry you are interviewing with. Now your project is able to "show, not tell" your interest.
One example of using a project to "show, not tell" was when I (Nick) applied to Facebook's Growth
team. They asked a common question: "Why Facebook's Growth team?"
I was able to tell the story of creating consumer products and being a DJ, which led me to create a
music-tech startup called RapStock.io. From RapStock.io I found my love of growing consumer
demand through engineering. This project sparked my interest in combining Software Engineering,
Data & Experimentation Design, and creating consumer products. This trajectory mapped exactly to
what Facebook's Growth team did all day, so I'd like to think I gave the perfect answer backed up by
an authentic story,
You might be thinking, "Nick, you got lucky working on a consumer tech startup with a focus on
growth that lined up beautifully with what Facebook's Growth team does." But dear reader, here's a
little secret:
 When I talked to fintech companies, I told them about the stock market and commodity pricing
aspect of my game.
 When I talked to data companies, I went deeper into the algorithm that assigned prices to
rappers from Spotify data.
 When I talked to marketing and ad-tech companies, I talked about how this tech project piqued
my interest in the world of advertising after a failed Google ads campaign.
 When I interviewed with startups, I talked about how I, too, was a startup founder in the past,
and wanted to move fast and ship things quickly rather than suffer through big company
bureaucracy.
With just one project, I was able to "show, not tell" my direct interest in a variety of companies and
types of work, while showcasing my technical expertise and personality at the same time. A kick-ass
portfolio project, along with the more detailed behavioral interview tips we present in Chapter 4, will
allow you to do the same.

Ace the Science Interview


Cold Email Your Way to Your
Dream Job in Data
CHAPTER 3

You 've crafted the perfect resume and made a kick-ass portfolio project, which means it time
to apply to an open data science position. Eagerly, you go to an online job portal and submit
your resume, and maybe even a cover letter. And then it's crickets. Not even an automated
rejection letter.
If you 've applied online and then been effectively ghosted, you 're not alone. Kevin and I
have been in the same situation plenty of times. We are all too familiar with the black hole
effect of online job applications, where it almost feels like you 're tossing your resume into
the void.
So how do you reliably land interviews, especially if you have no connections or referrals?
Two words: Cold. Emails.

While in college, Snapchat and Cloudflare interviewed me (Nick) when I had no connections or
referrals at those companies. I got these interviews by writing an email, out of the blue, to the
company's recruiters. This process is known as cold emailing (in contrast to getting a warm
introduction to a recruiter). Even my previous job at data startup SafeGraph is the result of a cold
email that sent to the CEO. We firmly believe this tactic can be a game changer on the data science
job hunt.

Ace the Science Interview


We don't want to over-promise, though. The best written cold email won't help if you're pursuing
jobs you aren't a good fit for, like a new grad applying to be a VP of Data Science. Plus, you need to
have a

Ace the Science Interview


CHAPTER 3 : COLD EMAIL YOUR WAY TO YOUR DREAM JOB IN DATA

strong resume (Chapter l) and strong portfolio projects (Chapter 2), But if you've got your ducks in a
row, yet struggle to land the first interview, this chapter will be a game changer.
Who are we even cold emailing?
Before we talk about the content of the cold email, let's cover who we're reaching out to in the first
place.
At smaller companies with less than 50 employees, emailing the CEO or CTO works very well. At mid-
range companies (from between 50 and 250 people), see if there is a technical recruiter to email;
otherwise just a normal recruiter should do. Another option is emailing the hiring manager for the
team you want to join.
For larger companies, finding the right person can be trickier. If you are looking for internships or are
a new grad, many of the larger companies (1,000+ employees) have a person titled "University
Recruiter" or "Campus Recruiter." Reaching out to these recruiters is how I (Nick) had the most luck
when cold emailing in college.
At very large companies like FANG, there should also be dedicated recruiters only working with data
scientists. To find these recruiters, go to the company's Linkedln page and hit "employees." Then,
filter the search results by title and search for "Data Recruiter." When doing this at Google, I found
six relevant data science recruiters to reach out to.
Another option is to just filter the search by "recruiter." You’ll get hundreds of results that you can
sift through manually. Doing so at Google uncovered an "ML recruiter," "PhD (Data Science)
Recruiter, " and a "Lead Recruiter, Databases & Data Analytics (GCP)," all in just a Few minutes.
Another good source of people to email at a company is alumni from your school who work there.
Even if they work in a non-data science role, they may be able to refer you or know the right person
to connect you with. To find these people, search your university on Linkedln and click "alumni."
From there, you can filter the alumni profiles based on what companies they work at or what titles
they hold. I resort to this tactic if my first Few cold emails to hiring managers and recruiters go
unanswered.

How do we find their email address?


Now that we know who we want to email, how do we find their email address?
You can use a free tool like Clearbit Connect or Hunter.io to look this up. If you can't find the person
you want with an email-lookup tool, you can always guess. At smaller companies, or when dealing
with the founders, a good guess is [email protected]. For example, Jeff Bezos's real
email is [email protected].
At mid-sized companies, [email protected] may work well, along with using
their first initial and last name with no spaces. When you're in doubt, you can use Hunter.io or
Clearbit Connect to see the format that others at the company use, and then you can make an
educated guess. Then put your best guess email address in the "to" section of the email, and cc a
few more email guesses in the hopes that one of the emails will be on target.
To shortcut all this work, you can also use MassApply (massapply.com), which has hundreds of
technical jobs with the recruiter contact information already available inside the platform. It allows
you to send customized cold emails in a single click to recruiters, as well as to track your job
applications. We're huge fans, but are a bit biased, because Nick's brother founded MassApply!

21 Ace the Data Science Interview Cold Email Your Way to Your Dream Job in Dato
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Can't Find an Email? Every address you guessed bounced?


Every address you guessed bounced? Send them a Linkedln lnMnil or Twitter DM instead!
While email should still be your first choice (busy people like recruiters and CEOs are in their email
inbox all day — not on Linkedln or Twitter), you have nothing to lose when reaching out on other
platforms! The tips we soon cover on writing effective cold emails apply to other types of cold
messages as well.

8 Tips for Writing Effective Cold Emails


Now that we know who to email and how to get their email, what do we actually say in the email?
Here are 8 quick tips for writing effective emails.

Tip #1: Keep the Email Short


Recruiters are busy people, as we covered in depth in Chapter I on resumes. Just like with your
resume, they don't have a lot of time to read your email. You've got 10 seconds to impress them
with your email so that they respond to it rather than ignore it. So, keep your email short.
The data backs email brevity. HubSpot analyzed 40 million emails and found the ideal length of a
cold sales email is between 50 and 125 words to maximize response rates. I've personally had the
best luck at around 100 words. It's all about maintaining a high signal-to-noise ratio. You don't need
to include phrases like "l hope you are doing well today!" or "I hope this email finds you well- I ' At
best, it's extraneous and at worst, insincere,

Tip #2: Mention an Accomplishment or Two


Our cold email is just like a sales email. In one paragraph, you are trying to sell yourself to the
recruiter as someone worthy of an interview. This is sales — don 't be shy.
Highlight a relevant accomplishment or internship experience that makes you worthy of a response.
Nagle-drop that hackathon you won. Hyperlink to your favorite project or an app on the app store
with a few thousand downloads and mention that usage number. If you went to an impressive
engineering school, lean into that.
However, you don't need to link to too many things or copy-paste the entire resume. That would end
up breaking Principle #1: Keep Il Short. Instead, attach your resume to the initial email so the
recruiter can get more background if needed.
Tip #3: Add Urgency and Establish a Timeline
My favorite tip: if you already have a return internship offer with a different company or a competing
job offer extended to you, mention that. It puts pressure on a recruiter to respond promptly and
might even fast-track you to an onsite interview. This tactic works especially well if it's an offer with a
well-known company,
Even if the deadline is very far from now, so there is no true urgency, name-dropping the other
company is helpful as social proof. If other companies desire you, then a recruiter is more likely to
feel you are valuable and have #fomo, This leads to them responding to you. You don't even need
the offer in hand to make this tactic work! Just having an onsite interview scheduled with a top
company helps other companies realize you've got something worthwhile and that there is a specific
timeline to adhere to.

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Former Microsoft Intern With Upcoming Deadline Interested


in Uber ATG Inbox

Nick Singh <[email protected]> 9:18 AM (0 minutes ago) ⋮


to recruiter

Hello Recruiter Name

I have an upcoming onsite-interview with Microsoft's Azure ML team next month, but wanted to
also interview with Uber because self-driving cars is where I believe Computer Vision will help
improve the world the most in the next decade.

Helping the world through CV became my passion after seeing the impact of the last project
made. which used CV to find and categorise skie diseases.

From reading The ATG engineering blogs, I know Uber is the best place for a passionate computer
vision engineer to make an impact, and am eager to start the interview process before I go too far
down the process with Microsoft.

I've attached my resume.


Thanks,
Nick Singh

One warning: be careful not to make it seem like the company you are talking to is the backup
option. To do this, make sure you convey enthusiasm for the company and mission. Below is an
example of that.

Tip #4: Relate Personally to the Recruiter or Company


Yes, this is a cold email, but you don't have to be so cold! The person at the other end of the email is
still a human, and you can make a real-life connection even if you haven't met before. It's well worth
it to do your research. For example, see if you have a mutual connection with the recruiter. Use
Linkedln to see if you have any commonalities like education or cities you've both lived in. Even two
minutes of sleuthing on the internet looking for a commonality can pay huge dividends when it
comes to response rates.

Tip #5: Have a Specific Ask


Be up front with what you want. A vague email hoping to "set up a time to chat" or "leam more
about the interview process" is too meek and indirect. The recruiter knows that between your
friends, Google Search, Quora, and Glassdoor, you can find any information you need about a
company and the interview process. They undoubtedly know you are angling for a job or internship
but are too shy to ask directly.
So why not be be bold — after all, fortune favors the bold — and always include a specific ask;
"I 'd like to interview for a Data Science Internship for Summer 2021.”
“I'd like to start the interview process for the Senior Data Scientist position at your-company.”

Tip #6: Have a Strong Email Subject Line


An email is only read if it's opened. Without a strong subject line to lure the recipient into actually
opening the email, the email is wasted. Thus, it's worth spending time crafting a strong email subject

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

line. To make the subject line click-worthy, it's key to include your most noteworthy and
relevant

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

details. It's okay if the subject line is keyword driven and a "big flex," as the teens say these days.
Borrow from BuzzFeed clickbait titles — they actually work! I wouldn't go so far as to say "21 Weird
Facts That'll Leave You DYING to Hire This Data Scientist," but you get the gist.
What I (Nick) used, for example: "Former Google & Microsoft Intern Interested in FT @ X"
This subject line works because I lead immediately with my background, which is click-worthy since
Google and Microsoft are well-known companies, and I have my specific ask (for full-time software
jobs) included in the subject line. Some other subject line examples that are short and to the point if
you can 't rely on internship experience at name-brand companies:
“Computer Vision Ph.D. Interested In Waymo”
“Princeton Math Major Interested in Quant @ Goldman Sachs”
“Kaggle Champion Interested in Airbnb DS”
“UMich Junior & Past GE Intern Seeking Ford Data Science FT”
If I (Nick) found a recruiter from my alma mater (UVA), I'd be sure to include that in the subject line
to show that it's personalized. For reaching out to UVA alumni, I'd thrown in a "Wahoowa" (similarly,
a "Go Bears" or "Roll Tide" if you went to Berkeley or Alabama, respectively). Including the name of
the recruiter should also increase the click-through rate.
Example: "Dan I FinTech Hackathon Winner And Wahoo Interested in Robinhood"
Another hack: including "Re:" in the subject line to make it look like they've already engaged in
conversation with you.

Tip #7: Eollow up 3 Times


A perfectly written email sent only once may not work. You should follow up at least three times, You
can reply directly to the thread, so that the context from the first email still remains there.
Don't worry about feeling too pushy — it's standard in sales to reach out 3+ times. I know first hand
not-to give up too early: some of the cold emails that turned into interviews only got responses after
the third email. Send the first follow-up after 3—4 days, and send the second follow-up 4—5 days
later. Don't think putting in a 2-week delay will make you come across as more polite.
A free Gmail plugin like Boomerang, which will flag when an email hasn't been responded to in some
time, can help to keep yourself accountable. There is also automatic email scheduling within
MassAppIy so that you can follow up three times.
If after a few emails you don't get a response, reach out to another recruiter at the same company.
It's okay to reach out to multiple people at a company that you want to work for. Trust me, it's not a
weird thing to do. In enterprise sales lingo, reaching out to multiple people al target company is
called being "multi-threaded into an account," and it's a time-tested tactic,

Tip #8: Send the Email at the Right Time


We've all been guilty of getting an email, reading it, and waiting till later to respond to it. And then
"later" never comes. That's why Principle #7 — following up three times — works so damn well. But
sending an email out at the right time can save you from having to bump up emails. To maximize
your reply rate, send the email when you think the reader is most likely to be free and in the mood
to respond. That means no weekend emails. No emails on holidays or days people typically might
take a long weekend. Figure out the lime zone for the recruiter, and be sure to not send it after

business hours.
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

I had the best luck emailing Silicon Valley recruiters at —11 A.M. or 2 P.M. P.S.T. The psychology
behind this is we've all felt ourselves counting down the minutes to lunch, aimlessly refreshing your
email and Slack to pass the time. That's a great time to catch someone. Same with the after lunch
lull. The best days I found to send emails were Tuesday through Thursday. I avoided Mondays since
that's the day many people have 1:1s or team meetings or have work they are catching up on from
the weekend. On Fridays, many people might be on PTO, or even if they are in office, have some
other kind of event like happy hour in the afternoon (or they've already mentally checked out before
the weekend).

3 Successful Cold Email Examples


Here are some real cold emails I’ve sent in the job hunt. The text is exactly the same as what sent,
but I just re-created it to protect the recipient's name and email address.
These emails aren't perfect by any means, but generally follow the eight tips I've laid out above. Just
remember: even sending a cold email that's bad puts you in the top decile of job seekers. Most
people will never make an effort to personally write an email to someone they don't know, and then
have the tenacity to follow up a few times. It's precisely why cold email works so well!

The 4 Sequence Email Drip to Periscope Data


Intro Email:

Ex-Google & Microsoft Intern Interested in


Working FT at Periscope Data Inbox

Nick Singh <[email protected]> Tue, Sep 8, 2020, 2:43 PM

to recruiter
Hi X,

Found your email on Hacker news. I’m a former Software Engineering Intern @ Google's Nest
Labs and Microsoft who will be graduating from college May' 17.

I'm interested in working full time at Periscope Data because of my interest in data engineering
(spent the summer on the Data Infrastructure team @ Nest) and my interest in turning data into
insights (built dashboards to do just that the past two summers).

How can I start the interview process?


Best,
Nick Singh

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

My Two Follow-Ups

Nick Singh 9:32 AM (1 minute ago)


to recruiter

Just wanted to follow up with you about full-time roles at Periscope data. believe my interest in
data engineering, along with past experience building dashboards and visualization tools, makes
me a good fit.

***

Nick Singh 9:33 AM (1 minute ago)


to recruiter
Hi X, Wanted to circle back on this. What do next steps look like?

The Hail Mary:


I send this email when I have on-site interviews near the target company planned, or when I have
offer deadlines approaching. This email is often sent weeks after the initial outreach. It works
because it adds an element of urgency to the recruiter, and it gives social proof that other companies
have vetted me enough to bring me on-site.

Nick Singh <[email protected]> Tue, Sep 8, 2020, 2:47 PM


to recruiter

Hi X,

Just wanted to follow up with you regarding opportunities with Periscope Data. I will be in
the bay area doing interviews with Facebook and Uber next week. Would love a chance to
do a phone interview with Periscope Data this week to assess technical fit: If we are a
good match, I’d be happy to swing by the office the following week for technical
interviews while I am already in town.

Thanks,
NIck Singh

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

More Examples of Real Cold Emails I’ve Sent


Cold Email to Airbnb in 2016

Former Google Intern from UVA Interested in


Airbnb

Nick Singh <[email protected]> 9:42 AM (0 minutes ago)


:
To recruiter,
Hello X,
we met briefly at the UVA in SF mixer this past summer.

I just wanted to reach out to you about new grad software engineering position @ Airbnb.
My friend Y, interned at Airbnb on the infrastructure team and really love their experience.
This past summer, I was on the data infrastructure team at Google's Nest Labs. From
talking to Y, I think I can be good fit for similar team at Airbnb. Let Me know what the
next steps are.
Thanks
Nick Singh
Cold Email to Reddit in 2015

Former Microsoft Intern @ Avid Redditor Interested


in SWE Internship

Nick Singh <[email protected]>


To recruiter,

Hello X,
I saw your post on hacker news and wanted to reach out regarding why I am a good fit to be a
software engineer intern at reddit for summer 2016.
I interned at Microsoft this past summer on the payments team where I helped the team turn data
into inside to diagnose payment issue faster.
In my free time (when I'm not on reddit) I build RapStock.io which I grew to 2000 users. 1400 out
of the 2000 users came from reddit when we went viral so I have a soft spot for the community and
product.
Let me know what next steps I should take.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Cold Email to SafeGraph in 2018 (how got my last job!)


Below is the screenshot of the exact email I sent in summer of 2018 to SafeGraph CEO, Auren
Hoffman. I decided to email the CEO directly, in addition to my AngelList application, due to my poor
track record of hearing back from online job portals. Within 24 hours of sending this email, I had an
interview booked with Auren and ended up working at SafeGraph for close to two years. That's the
power of cold email!
By utilizing these cold email tips and taking inspiration from these cold-email examples, in
conjunction with a strong resume and kick-ass portfolio projects, you're well on your way to landing
more data science interviews. Now comes the next challenge of the data science job hunt: acing the
behavioral interview.

NIpun Singh <[email protected]> @ sun, Jul 15 2010. 2:29 AM


to auren, Auren.hollman

Auren,
I'm super Interested in the COS rote at Safe Graph. I applied on Angel list but figured I'd also shoot you an email.

I'm a good fit for the role because


• I’m currently a Software Engineer on Facebook’s Growth teem. I’m data-driven to the max • all day either
coding or cutting data to understand the impact of what I coded and the AFB tests I ran.
• I'm a hustler. I ran a startup in college, which I grew io 2,000 MAU. I ran a DJ business In high school which
taught me how to be a people person and also sell. I helped run the Venture Capital club (Virginia Venture
Fund) and Entrepreneurship Group (HackCviIle) at my college,
• I studied Systems Engineering In college. which is super similar 10 Industrial Engineering / OR (your major).
I also studied Computer Science. I've taking classes on ML. Computer Vision. Stochastic Processes, Database.
I'd be able to understand the technical details of SafeGraph and the space we operate In very quickly.
I’ve attached my resume. I'd love to call or meet up In person to talk more about why I'm excited about SafeGrnph.

Thanks,

Nipun Singh
www.nipunsingh.com

Ace the Data Science Interview


Ace the Behavioral Interview
CHAPTER 4

Now that you've finally built a kick-ass resume, compiled an impressive project portfolio and
intrigued the HR department at your dream company enough to tall you in for an interview
based on your strategically written emails, you're ready to ace the technical data science
interview questions and land the job. But there's one more piece to the puzzle whose
importance is usually underestimated: the behavioral interview. While it's true that 90% of
the reason candidates pass interviews for the most coveted big tech and finance jobs is
because of their technical skills — their ability to code on the spot, write SQL queries, and
answer conceptual statistics questions — neglecting the other 10% which stems from the
behavioral inter-view, can be a huge mistake.

Behavioral Interviews Aren't Fluffy B.S.


You may not agree that the behavioral questions are important. You might think this behavioral
interview stuff is fluffy bullshit, and that you can simply wing it. Sure, in some companies, as long as
you aren't an asshole in the interview and have the technical skills, you'll be hired. But some
companies take this very seriously. Amazon, for example, interviews every single candidate on the
company leadership principles like "customer obsession" and "invent and simplify" in their bar raiser
interviews. Uber also includes a bar-raiser round which focuses on behavioral questions about cul
ture fit, previous work experience, and past projects. If you want to work there, you've got to take
the behavioral interviews seriously.
Even small companies have their unique twist on this. Consider this: my (Nick's) former
employer,
SafeGraph, displayed the company values on a poster hung in every room in the building. Even
the

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

bathroom. While you're pissing, the company values are in your face. No joke. It’s that paramount.
So imagine if you came to an interview with me and didn't share any stories that exhibited
SafeGraph's company values. You'd have a pretty piss-poor chance of passing the interview!

When Do Behavioral Interviews Happen?


You might be wondering, “I had four interviews for a position, and not one of them was called a
behavioral interview.”
The behavioral interview is an integral part of any interview and consists of questions that help the
interviewer assess candidates based on actual experiences in prior jobs or situations. Even the
"friendly chat" time at the start of the interview can essentially be a behavioral interview.
So while you might not have an explicit calendar invite for "Behavioral Interview," don't be fooled:
behavioral interviews occur all the damn time. That casual, icebreaker of a question, “So, tell me
about yourself…" — that's an interview question! For every job, and at practically every round, there
will be a behavioral component, whether it's explicit or not. Behavioral interview questions can
happen:
 With a recruiter before getting to technical rounds. In which case you might not even get to the
technical interview rounds…
 During your technical interviews, where the first 5-10 minutes are usually carved out for a casual
chat about your past projects and your interest in the company.
 During lunch, to understand how you behave outside of the interview setting.
 Al the end of the on-site interview; they know you can do the work, but are you someone they
want to personally work with? You'll meet with your future boss, and maybe even their boss,
where they'll both try to sell you on the company, but also see if you'd be a good culture fit.
The reality is, you are constantly being assessed! That’s why, on the basis of frequency alone,
preparing for and practicing answers to these questions is well worth the effort.

Ace the Behavioral Interview to Beat the Odds


Acing the behavioral interview can be the X-factor — the thing that separates you out from the
horde of other applicants. You don't want to be lying in bed at night, wide awake, thinking, "Damn it,
I forgot to tell them about that time I caught that data analysis mistake and saved the company
$50,000!" A little prep work for your interview can mean the difference between a strikeout and a
home run.
Focusing on behavioral interviews is especially important if you're new to the data science game.
When a company makes an investment in junior talent, they are looking at 3-6 months of training
before that junior data person becomes truly productive. You probably won't give the best answers
on technical questions, and there will always be more senior candidates in the pipeline, but the
coachability, enthusiasm, and eagerness to learn that you show in the behavioral interview could be
what convinces a company to take a chance and invest in you.

3 Things Behavioral Interviews Test For


You know that behavioral interviews are important, that they happen all the time, and that the
stakes are high especially for junior talent. Now you might be wondering, "What are interviewers

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

even looking for?" Behavioral questions have to do with...well...behavior. There are three basic kinds
of things an employer tests for:
 Soft skills: How well do you communicate? So much of effective data science is dealing with
stakeholders — would you be able to articulate your proposals to them, or sell your ideas
convincingly enough to get buy-in from them? How well do you work with others? Data science
is a team sport, after all! How do you deal with setbacks and failures? Do you get defensive, or
exhibit a growth mindset?
 Position fit: How interested are you in the job and team you're gunning for? What motivates
you about the position — only the paycheck or passion as well?
 Culture fit: How well do you fit the team and company’s culture? Can you get behind the
company's mission and values? Basically the corporate version of a "vibe check"!
Essentially, while technical interviews are about whether you can do the job, behavioral interviews
are about whether you want to do the job, and if you are someone others will want to work with.
Fortunately, you can have the charisma of Sheldon Cooper from Big Bang Theory and still pass
behavioral interviews — if you prepare for the most common behavioral questions asked in data
science interviews.

Tell Me About Yourself: The #1 Behavioral interview


Question
"Tell me about yourself' may seem like a simple icebreaker to ease tension and get the interview
rolling, but it's actually the #1 most asked behavioral interview question! If you are not properly
prepared with your answer, you can stumble through it blithely, telling your life history and all sorts
of irrelevant details that are not what they want to know about you. First impressions matter, and a
well-thought-out answer can impress the hell out of your interviewer and put you in the running
from the get-go.
So, how do you prepare an awesome answer to this seemingly innocuous question?
 Limit your answer to a minute or two; don't ramble! As such, start your story at a strategically
relevant point (which is often college for most early-career folks).
 Relate your story to the position and company at hand. See if you can weave your pitch with key
terms from the job description and company values, Speak their language!
 Mention a big accomplishment or two; even though they've seen your resume, don't let them
target about your biggest selling point!
 Rehearse. You know this question will be asked at the start of every interview.
Your answer should include these three key points:
1) Who you are
2) How you came to be where you are today (sprinkle in your achievements here)
3) What you're into/looking for now (hint: it's basically this role + this company)
To make this more concrete, here's the "about me" pitch we authors used on the job hunt.

Ace the Data Science Interview


CHAPTER 4 ACE THE BEHAVIORAL INTERVIEW

Kevin's Wall Street "About Me" Pitch


Hi, I'm Kevin, currently a data scientist at Facebook. I graduated from Penn in 2017, studying
computer science, statistics, and finance. At Facebook I focused on analytics within the groups
team, making sure Facebook Groups is free of spam and hate speech. Before Facebook, I briefly
interned at a hedge fund, working on looking at alternative data sets, like clickstream data and
satellite imagery, to analyze stocks. Having worked in both big tech and Wall Street, I've come
to realize I'm more passionate about applying data science in financial markets because of the
fast-paced nature and high stakes environment. I was drawn to your fund in particular due to
the small team, high autonomy, and chance to be part of a more greenfield data science effort.

Nick's Google Nest Data Infrastructure Internship "About Me" Pitch


Hi! I'm Nick, and I'm currently a 3rd year student at the University of Virginia! I love the
intersection between software and data, which is why I'm studying Systems Engineering and
Computer Science at UVA. It's also why two summers ago, I interned as a Data Scientist at
defense contractor, and last summer I interned on the Payments team at Microsoft doing back-
end work.
I'm super excited to potentially work on the Data Infrastructure team at Nest Labs since it's the
perfect blend of my past data and SWE experience. Plus, Nest's intelligent home automation
products rely on great data and machine learning, and I want to work at a company where data
is at the forefront. Lastly, I love how you all are a smaller, faster-paced division within Google.
Having made a startup in the past, which I growth hacked to 2,000 users in just a few months, I
love the "move fast" attitude of smaller companies. I think that Nest being an autonomous
company within Google strikes the perfect blend between startup and big tech company, and
it's why I'm so excited by this team and company.
Why did you choose Data Science?
Here's another question you might be asked, which is closely related to your personal pitch: "Why
did you choose Data Science?" Likely your answer to "tell me about yourself” contains some element
of how you got into the field, but you may be asked to harp on this point more, especially if you're
an industry switcher, or come from an untraditional background.
If your path isn't the most straightforward, don't be nervous — capitalize on this opportunity to
show you are a go-getter who decided to make a career change, a fast learner who has accomplished
so much in a short tune, and how your passion for the field is genuine! This is also a great
opportunity to talk about how your skills from prior jobs and industry experience naturally led you
into data science. Remember, data science is much more than modeling — even if you weren't
throwing XGBoost at random datasets in your last job, there must have been some relevant data
science-adjacent skills you acquired. And deep down, internalize that your newness to the field isn't
a weakness, but a strength — you've probably got extra subject matter expertise and a fresh
perspective!

Tell Me About a Time: The #1 Most Common Pattern for Questions


Once you have your opening pitch prepared, along with the story of how you got into the field, it's
time to focus on the other questions most likely to be asked.

34 Ace the Data Science Interview Behavioral


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

The #1 question after “tell me about yourself” is: "Tell me about a situation where" something
happened. Note that this question can be phrased in various ways: "Give me an example of when
you

X" and "Describe a situation when you Y." This is your time to share war stories from past jobs. If you
lack work experience, this is the time for your independent portfolio projects to shine.

Most Common "Tell Me About a Time" Questions


Some of the most commonly asked "tell me about a time" questions are:
Tell me about a time...
 you dealt with a setback — how did you handle it?
 you had to deal with a particularly difficult co-worker — how did you manage it?
 you made a decision that wasn't popular — how did you go about implementing it?
 you accomplished something in your career that made you very proud — why was that moment
meaningful to you?
 you missed a big deadline — how did you handle it?
"Tell Me About a Time" for Data Scientists
While the above popular questions are fair game, you might also be asked a twist on these questions
so that they 're better geared towards data scientists, analysts, and machine learning engineers.
Below are some more data-driven behavioral interview questions.
Tell me about a time…
 when data helped drive a business decision.
 where the results of your analysis were much different than what you would have expected. Why
was that? What did you do?
 when you had to make a decision BUT the data you needed wasn’t available.
 you had an interesting hypothesis — how did you validate it?
 when you disagreed with a PM or engineer
Now that we've got the laundry list of situational questions out of the way, how do you answer these
questions on the spot?

A superSTAR Answer
The trick to answering the behavioral questions we listed earlier on the spot is... well...to NOT
answer them on the spot! A lot of preparation needs to go into this so you can give effortless off-the-
cuff answers come interview time. Your first step in preparing flawless answers is to prepare stories
that address the questions we mentioned earlier. Bul don't prepare factual answers.
Prepare stories.
“But I’m no storyteller, I’m a data scientist! How can I supposed to “weave a fascinating tale”
about something as mundane as work history?”
Luckily, there is a simple formula you can use as a framework to structure your story. It's easy to
remember, too. Just remember that a great story will make you a STAR, so you have to use the STAR
formula:
 Situation — Describe a specific challenge (problem or opportunity) you or your team, your
company, or your customers encountered.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

 Task — Describe the goal you needed to accomplish (the project or task).
 Action — Describe your role in the project or task using first person (not what your team did, but
what you did).
 Result — Describe what happened as a result of your actions. What did you learn or accomplish?
Keep in mind that not all outcomes need be positive; if things didn't go your way, explain what
lesson you learned (for example, "I learned about the importance of transparency and clear
communication"). Showing that you can handle failure and learn from it is a great trait!
Write your stories out using the STAR formula. Where possible, weave into your narrative key
phrases from the job description and the company culture or values page, so that you hit the
position fit and culture fit elements of the interview.

Amazon Data Scientist Interview Example


For a concrete example of STAR, assume I (Nick) am interviewing to be a data scientist on the AWS
Product Analytics team. According to the job opening, the role entails influencing the long-term
roadmap of the AWS EC2 Product Team. The job description also mentions looking for someone with
a startup mentality, since "AWS is a high-growth, fast-moving division." Finally, their preferred
qualifications include “demonstrated ability to balance technical and business needs” and
"independently drive issues to resolution while communicating insights to nontechnical audiences."
Now that we've set up the role I'm interviewing for, imagine the Amazon bar-raiser hits me with the
question: "Tell me about a time you were not satisfied with the status quo."
My answer:
 Situation: I challenged the status quo back when I worked on Facebook's Growth Team,
specifically on the New User Experience division. Our main goal was to improve new user
retention rates. In 2018, there was a company-wide push for Facebook Stories based on the
success of Instagram stories and the fear of Snapchat gaining even more market share. The
status quo was to prioritize features that would promote the adoption of Facebook Stories, but I
had a strong hunch this wasn't good for new users.
 Task: My goal was to understand how new users used Facebook Stories, and whether the feature
helped or hurt new user retention rates.
 Action: For 3 weeks, I sliced and diced data to better understand whether Facebook Stories
helped or hurt new users. In the process, I found multiple bugs and user experience gaps related
to Stories for new users, which led to decreased retention rates for new users. I fixed the smaller
bugs, and presented the bigger data-driven insights into the user experience problems with the
wider Facebook Stories team as well as the New Person Experience team.
 Result: Fixing the bugs resulted in new user retention rates increasing by X%, and Y % more
usage in Facebook Stories. More importantly, by questioning the status quo that Facebook
Stories was good for everyone, I made the Facebook Stories team more conscious of gaps in the
product as it related to new users. This affected the Facebook Stories product roadmap, and led
them to prioritize user onboarding features for their next quarter.

This is an effective answer because it emphasizes how my data-driven work impacted the product
roadmap — essentially what this Amazon product analytics job is all about. It also demonstrates my
passion for new users, which jives with Amazon's company value of customer obsession.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Remember, though, a winning answer to a behavioral interview question is about more than just
words. Project the confidence of a college sophomore who thinks majoring in business means
they'll

be a CEO one day. Embody BDE — big data energy. To dial in your delivery, practice telling your
stories out loud. Do this in front of a mirror it'll force you to pay attention to your nonverbal skills,
which are also very important in an interview. Use a timer, and without rushing, ensure your answers
are under two minutes long.

How to Ace Project Walk-Through Questions


Rather than asking you about a situation, project walk-through questions let you talk about a project
in detail. These questions often have follow-ups where they ask for more details — and they may
even be a jumping-off point to ask more general technical questions.
In addition to checking your communication skills, like the more traditional behavioral questions,
these questions are also testing to see if you've actually done what you say you did. The bullet points
on your resume don't always tell the whole story — maybe the work is less (or more!) impressive
than you made it sound. In fact, with the length limitations on a resume's job description, there's
probably a LOT more to the story than the resume reveals.
In project walk-throughs, you might specifically be asked questions such as:
 How did you collect and clean the data? Did you run into any issues when interpreting
the data?
 How did you decide what models and techniques to use? What did you eventually try?
 How did you evaluate the success of your projects? Was there a baseline to compare
against? What metrics did you use to quantify the project's impact?
 Did you deploy the final solution? What challenges did you face launching your work?
 What tough technical problems did you face — and how did you overcome them?
 How did you work with stakeholders and teammates to ensure the project was
successful? If there were any conflicts, how did you resolve them?
 If you did the project again. what would you do differently?
If the above questions look familiar, that's because you can consider the project walk-through
questions as the inverse of "tell me about a time" questions. Said another way, given a project, the
interviewer asks a lot of the same "tell me about a time" questions, except the "time" is all about a
certain project. The same concepts for applying STAR still apply!

"Do you have any questions for us?"


Pretty much every interview includes a segment where you get to ask questions. "I don't really have
any" is NOT the right answer. So, what should you ask when the interviewer says, "Do you have any
questions for us?"
There is a right way to answer this! Don't waste this time asking random questions like how much
time you get off or how much the job pays! Traditional advice says, "This is the time to interview the
company." We disagree! Be strategic about what questions you ask. Have the mindset "until I have
the offer in hand, I need to keep showing why I'm a good fit" you aren't interviewing them, you're
selling yourself! As such, prepare at least three smart, interesting questions per interviewer, Don't

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

pass on this! You can ask about salary once you've got the job in the bag. At that point, you are in a
far better position to discuss compensation.
As we mention later in Chapter 10: Product Sense, this is the time to leverage the company and
product research you did. You'll gain much more by asking questions that convey your interest in the
company and what they do. From your readings and product analysis, surely you must be curious
about some internal detail, design decision, or what's coming next for some product. This point in
the interview is your opportunity not only to have your intellectual curiosity fulfilled, but to impress
them with your research and genuine interest in the company and its business.

Another idea is to check out details about your interviewer on Linkedln. It's not uncommon to know
who you'll be interviewing with. Asking a personal question is a sure way to get the interviewer
talking about themself. And people love to do that! If you can tailor questions to their background or
projects they've worked on, great! If not, you can ask these sure-fire conversation starters:
How did you come into this role or company?
What the most interesting project you've worked on?
What do you think is the most exciting opportunity for the company or product?
In your opinion, what are the top three challenges facing the business?
What do you think is the hardest part of this role?
How do you see the company values in action during your day-to-day work?
Going against the grain from traditional career advice, we think asking questions about the role isn't
the most beneficial use of this opportunity. Sure, you're not going to get into trouble for asking
about the growth trajectory for the role at hand, or what success looks like for the position. It's just
that you'll have ample time, and it's a better use of your time to ask these questions after you have
the job. While you're in the interview mode, again, it's important to either reinforce your interest in
the company, their mission and values, or at least have the interviewer talk about themself. We
believe discussing nuances about the role isn't the most productive step to take without an offer at
hand.

Post-interview Etiquette
Whew! Your interview is finally over!
No, it's not!
Send a follow-up thank you note via email a few hours after your interview to keep your name and
abilities fresh in their mind. Plus it shows them your interest in the position is deep and sincere.
Ideally, you'Il mention a few of the specific things you connected with them over during the
interview in your email/note. This will help jog their memory as to which interviewee you were and
hopefully bring that connection to mind when they see your name again.

The Best Is Yet to Come!


Now that we've walked you through our process for landing more data science interviews and
passing the behavioral interview screen, you're almost ready for the meat of the book: acing the
technical data science interview.
Before we get to the 201 interview problems, we have a quick favor to ask from you. Yes, you. If
you're enjoying this book, share a photo of your copy of Ace The Data Science Interview on Linkedln
and tag us (Nick Singh and Kevin Huo). Feel free to add a quick sentence or two on what's resonated

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

with you so far. We'll both connect with you as well as like and comment on the post. You'll get more
Linkedln profile views, followers, and brownie points from us this way.

Ace the Data Science Interview


Probability
CHAPTER 5

One of the most crucial skills a data scientist needs to have is the ability to think
probabilistically. Although probability is a broad field and ranges from theoretical concepts
such as measure theory to more practical applications involving various probability
distributions, a strong foundation in the core concepts of probability is essential.
In interviews, probability's foundational concepts are heavily tested, particularly conditional
probability and basic applications involving PDFs of various probability distributions. In the-
finance industry, interview questions on probability, including expected values and betting
decisions, are especially common. More in-depth problems that build off of these
foundational probability topics are common in statistics interview problems, which we cover
in the next chapter. For now, we 'Il start with the basics of probability.

Basics
Conditional Probability
We are often interested in knowing the probability of an event A given that an event B has occurred.
For example, what is the probability of a patient having a particular disease, given that the patient
tested positive for the disease? This is known as the conditional probability of A given B and is often
found in the following form based on Bayes' rule:
P ( B ¿ ) P( A)
P ( A ¿ )=
P(B)
Under Bayes' rule, P(A) is known as the prior, P(B\A) as the likelihood, and P(A\B) as the posterior.

If this conditional probability is presented simply as P(A)—that is, if P(A\B) = P(A)—then A and B are
independent, since knowing about B tells us nothing about the probability of A having also occurred.

Ace the Data Science Interview


40
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Similarly, it is possible for A and B to be conditionally independent given the occurrence of another
event C: P(A  B\C) = P(A\C)P(B\C)
The statement above says that, given that C has occurred, knowing that B has also occurred tells us
nothing about the probability of A having occurred.
If other information is available and you are asked to calculate a probability, you should always
consider using Bayes' rule. It is an incredibly common interview topic, so understanding its
underlying concepts and real-life applications involving it will be extremely helpful. For example, in
medical testing for rare diseases, Bayes' rule is especially important, since it is may be misleading to
simply diagnose someone as having a disease—even if the test for the disease is considered "very
accurate"—without knowing the test's base rate for accuracy.
Bayes' rule also plays a crucial part in machine learning, where, frequently, the goal is to identify the
best conditional distribution for a variable given the data that is available. In an interview, hints will
often be given that you need to consider Bayes' rule. One such strong hint is an interviewer's
wording in directions to find the probability of some event having occurred "given that" another
event has already occurred.

Law of Total Probability


Assume we have several disjoint events within B having occurred; we can then break down the
probability of an event A having also occurred thanks to the law of total probability, which is stated
as follows: P(A) = P(A\B1)P(B1)+…+ P(A\Bn)P(Bn)
The equation above provides a handy way to think about partitioning events. If we want to model
the probability of an event A happening, it can be decomposed into the weighted sum of conditional
probabilities based on each possible scenario having occurred. When asked to assess a probability
involving a "tree of outcomes" upon which the probability depends, be sure to remember this
concept. One common example is the probability that a customer makes a purchase, conditional on
which customer segment that customer falls within.

Counting
The concept of counting typically shows up in one form or another in most interviews. Some
questions may directly ask about counting (e.g., "How many ways can five people sit around a lunch
table?"), while others may ask a similar question, but as a probability (e.g., "What is the likelihood
that draw four cards of the same suit?").
Two forms of counting elements are generally relevant. If the order of selection of the n items being
counted k at a time matters, then the method for counting possible permutations is employed:
n!
n∗( n−1 )∗…∗( n−k +1 ) =
( n−k ) !
In contrast, if order of selection does not matter, then the technique to count possible number of
combinations is relevant:

( nk)= k ! ( n−k
n!
)!

Ace the Data Science Interview 39


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Knowing these concepts is necessary in order to assess various probabilities that involve counting
procedures, Therefore, remember to determine when selection does versus does not matter.
For some real-life applications of both, consider making up passwords (where order of characters
matters) versus choosing restaurants nearby on a map (where, order does not matter, only the
options). Lastly, both permutations and combinations are frequently encountered in combinatorial
and graph theory-related questions.
Random Variables
Random variables are a core topic within probability, and interviewers generally verify that you
understand the principles underlying them and have a basic ability to manipulate them. While it is
not necessary to memorize all mechanics associated with them or specific use cases, knowing the
concepts and their applications is highly recommended,
A random variable is a quantity with an associated probability distribution. It can be either discrete
(i.e., have a countable range) or continuous (have an uncountable range). The probability distribution
associated with a discrete random variable is a probability mass function (PMF), and that associated
with a continuous random variable is a probability density function (PDF). Both can be represented
by the following function of x : f x ( x)
In the discrete case, X can take on particular values with a particular probability, whereas, in the
continuous case, the probability of a particular value of x is not measurable; instead, a "probability
mass"-per unit per length around x can be measured (imagine the small interval of x and x +δ ).
Probabilities of both discrete and continuous random variables must be non-negative and must sum
(in the discrete case) or integrate (in the continuous case) to 1:

Discrete : ∑ f x ( x )=1 , Continuous: ∫ f x ( x ) dx=1
x∈X −∞

The cumulative distribution function (CDF) is often used in practice rather than a variable's PMF or
PDF and is defined as follows in both cases: F x ( x ) =p (X ≤ x)

For a discrete random variable, the CDF is given by a sum: F X ( x )=∑ p(k ) whereas, for a
k≤x

continuous random variable, the CDF is given by an integral:


x
F x ( x ) =∫ p ( y ) dy
−∞

Thus, the CDF, which is non-negative and monotonically increasing, can be obtained by taking the
sums of PMFs for discrete random variables, and the integral of PDFs for continuous random
variables.
Knowing the basics of PDFs and CDFs is very useful for deriving properties of random variables, so
understanding them is important. Whenever asked about evaluating a random variable, it is
essential to identify both the appropriate PDF and CDF at hand.

Joint, Marginal, and Conditional Probability Distributions


Random variables are often analyzed with respect to other random variables, giving rise to joint
PMFs for discrete random variables and joint PDFs for continuous random. variables, In the
continuous case, for the random variables X and Y varying over a two-dimensional space, the
integration of the joint. PDF yields the following:
∞ ∞

∫ ∫ f x , y ( x , y ) dxdy=1
−∞ −∞

Ace the Data Science Interview 39


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

This is useful, since il allows for the calculation of probabilities of events involving X and Y.
From a joint PDF, a marginal PDF can be derived. Here, we derive the marginal PDF for X by
integrating out the Y term:

f x ( x ) =∫ f x, y ( x , y ) dy
−∞

Similarly, we can find a joint CDF where F x , y ( x , y ) =P (X ≤ x , Y ≤ y ) is equivalent to the following:


x y
F x , y ( x , y ) =∫ ∫ f x , y ( u , v ) dvdu
−∞ −∞

It is also possible to condition PDFs and CDFs on other variables. For example, for random variables X
and Y, which are assumed to be jointly distributed, we have the following conditional probability:

f x ( x ) =∫ f y ( y ) f X ∨Y (x∨ y)dy
−∞

where X is conditioned on Y. This is an extension of Bayes' rule and works in both the discrete and
continuous case, although in the former, summation replaces integration.
Generally, these topics are asked only in very technical rounds, although a basic understanding helps
with respect to general derivations of properties. When asked about more than one random
variable, make it a point to think in terms of joint distributions.

Probability Distributions
There are many probability distributions, and interviewers generally do not test whether you have
memorized specific properties on each (although it is helpful to know the basics), but, rather, to see
if you can properly apply them to specific situations. For example, a basic use case would be to
assess the probability that a certain event occurs when using a particular distribution, in which case
you would directly utilize the distribution's PDF. Below are some overviews of the distributions most
commonly included in interviews.

Discrete Probability Distributions


The binomial distribution gives the probability of k number of successes in n independent trials,
where each trial has probability p of success. Its PMF is

k
k
()
P ( X=k )= n p (1− p)
n−k

and its mean and variance are: μ=np , 2=np (1− p).
The most common applications for a binomial distribution are coin flips (the number of heads in n
flips), user signups, and any situation involving counting some number of successful events where
the outcome of each event is binary.
The Poisson distribution gives the probability of the number of events occurring within a particular
fixed interval where the known, constant rate of each event's occurrence is . The Poisson
distribution's PMF is
k

e−¿❑
P ( X=k )= ¿
k!
and its mean and variance are: μ = , 2= .

Ace the Data Science Interview 39


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

The most common applications for a Poisson distribution are in assessing counts over a continuous
interval, such as the number of visits to a website in a certain period of time or the number of defects
in a square foot of fabric. Thus, instead of coin flips with probability p of a head as a use case of the
binomial distribution, applications on the Poisson will involve a process X occurring at a rate .

Continuous Probability Distributions


The uniform distribution assumes a constant probability of an X falling between values on the
interval a to b. Its PDF is
1
f ( x )=
b−a
and its mean and variance are:
2
a+ b 2 (b−a)
μ= ,σ =
2 12
The most common applications for a uniform distribution are in sampling (random number
generation, for example) and hypothesis testing cases.
The exponential distribution gives the probability of the interval length between events of a Poisson
process having a set rate parameter of . Its PDF is f ( x )=e−x and its mean and variance are:
1 2 1
μ= ,σ = 2
❑ ❑
The most common applications for an exponential distribution are in wait times, such as the time
until a customer makes a purchase or the time until a default in credit occurs. One of the
distribution's most useful properties, and one that makes for natural questions, is the property Of
memory lessness the distribution.
The normal distribution distributes probability according to the well-known bell curve over a range
of X's. Given a particular mean and variance, its PDF is

( )
2
1 (x−μ)
f ( x )= exp−
√ 2 πσ 2
2 σ2
and its mean and variance are given by: μ=μ , σ 2=σ 2
Many applications involve the normal distribution, largely due to (a) its natural fit to many real-life
occurrences, and (b) the Central Limit Theorem (CLT). Therefore, it is very important to remember
the normal distribution's PDF.

Markov Chains
A Markov chain is a process in which there is a finite set of states, and the probability of being in a
particular state is only dependent on the previous state. Stated another way, the Markov property is
such that, given the current state, the past and future states it will occupy arc conditionally
independent.
The probability of transitioning from state i to state j at any given time is given by a transition matrix,
denoted by P:

Ace the Data Science Interview 39


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

( P11 … P1 n
… ¿ ¿
… ¿ P mn¿
)

Ace the Data Science Interview 39


CHAPTER 5 : PROBABILITY

Various characterizations are used to describe states. A recurrent state is one whereby, if entering
that state, one will always transition back into that state eventually. In contrast, a transient state is
one in which, if entered, there is a positive probability that upon leaving, one will never enter that
state again.
A stationary distribution for a Markov chain satisfies the following characteristic: π=πP , where P is
a transition matrix, and remains fixed following any transitions using P. Thus, P contains the long-run
proportions of the time that a process will spend in any particular state over time.
Usual questions asked on this topic involve setting up various problems as Markov chains and
answering basic properties concerning Markov chain behavior. For example, you might be asked to
model the states of users (new, active, or churned) for a product using a transition matrix and then
be asked questions about the chain's long-term behavior. It is generally a good idea to think of
Markov chains when multiple states are to be modeled (with transitions between them) or when
questioned concerning the long-term behavior of some system.

Probability Interview Questions


Easy
5.1. Google: Two teams play a series of games (best of 7 — whoever wins 4 games first) in which
each team has a 50% chance of winning any given round (no draws allowed). What is the
probability that the series goes to 7 games?
5.2. JP Morgan: Say you roll a die three times. What is the probability of getting two sixes in a row?
5.3. Uber: You roll three dice, one after another. What is the probability that you obtain three
numbers in a strictly increasing order?
5.4. Zenefits: Assume you have a deck of 100 cards with values ranging from 1 to 100, and that you
draw two cards at random without replacement. What is the probability that the number of
one card is precisely double that of the other?
5.5. JP Morgan: Imagine you are in a 3D space. From (0,0,0) to (3,3,3), how many paths are there if
you can move only up, right, and forward?
5.6. Amazon: One in a thousand people have a particular disease, and the test for the disease is
98% correct in testing for the disease. On the other hand, the test has a 1% error rate if the
person being tested does not have the disease. If someone tests positive, what are the odds
they have the disease?
5.7. Facebook: Assume two coins, one fair (having one side heads and one side tails) and the other
unfair (having both sides tails). You pick one at random, flip it five times, and observe that it
comes up as tails all five times. What is the probability that you are flipping the unfair coin?
5.8. Goldman Sachs: Players A and B are playing a game where they take turns flipping a biased
coin, with p probability of landing on heads (and winning). Player A starts the game, and then
the players pass the coin back and forth until one person flips heads and wins. What is the
probability that A wins?
5.9. Microsoft: Three friends in Seattle each told you it is rainy, and each person has a 1/3
probability of lying, Whal is the probability that Seattle is rainy, assuming that the likelihood of
rain on any given day is 0.25?
5.10. Bloomberg: You draw a circle and choose two chords at random. What is the probability that
those chords will intersect?

Ace the Data Science Interview Probability


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

5.11. Morgan Stanley: You and your friend are playing a game. The two of you will continue to loss a
coin until the sequence HH or TH shows up. If HH shows up first, you win. If TH shows up first,
your friend wins. What is the probability of you winning?
5.12. JP Morgan: Say you are playing a game where you roll a 6-sided die up to two times and can
choose to stop following the first roll if you wish. You will receive a dollar amount equal to the
final amount rolled. How much are you willing to pay to play this game?
5.13. Facebook: Facebook has a content team that labels pieces of content on the platform as either
spam or not spam. 90% of them are diligent raters and will mark 20% of the content as spam
and 80% as non-spam. The remaining 10% arc not diligent raters and will mark 0% of the
content as spam and 100% as non-spam. Assume the pieces of content are labeled
independently of one another, for every rater. Given that a rater has labeled four pieces of
content as good, what is the probability that this rater is a diligent rater?
5.14. D.E. Shaw: A couple has two children. You discover that one of their children is a boy. What is
the probability that the second child is also a boy?
5.15. JP Morgan: A desk has eight drawers. There is a probability of 1/2 that someone placed a
letter in one of the desk's eight drawers and a probability of 1/2 that this person did not place
a letter in any of the desk's eight drawers. You open the first 7 drawers and find that they are
all empty. What is the probability that the 8th drawer has a letter in it?
5.16. Optiver: Two players are playing in a tennis match, and are at deuce (that is, they will play
back and forth until one person has scored two more points than the other), The first player
has a 60% chance of winning every point, and the second player has a 40% chance of winning
every point. What is the probability that the first player wins the match?
5.17. Facebook: Say you have a deck of 50 cards made up of cards in 5 different colors, with 10 cards
of each color, numbered 1 through 10. What is the probability that two cards you pick at
random do not have the same color and are also not the same number?
5.18. SIG: Suppose you have ten fair dice. If you randomly throw these dice simultaneously, what is
the probability that the sum of all the top faces is divisible by 6?

Medium
5.19. Morgan Stanley: A and B play the following game: a number k from 1-6 is chosen, and A and
B will toss a die until the first person throws a die showing side k , after which that person is
awarded $100 and the game is over. How much is A willing to pay to play first in this game?
5.20. Airbnb: You are given an unfair coin having an unknown bias towards heads or tails. How can
you generate fair odds using this coin?
5.21. SIG: Suppose you are given a white cube that is broken into 3 x 3 x 3 = 27 pieces. However,
before the cube was broken, all 6 of its faces were painted green. You randomly pick a small
cube and see that 5 faces are white. What is the probability that the bottom face is also
white?
5.22. Goldman Sachs: Assume you take a stick of length 1 and you break it uniformly at random
into three parts. What is the probability that the three pieces can be used to form a triangle?
5.23. Lyft: What is the probability Chat, in a random sequence of H’s and T’S, HHT shows up before
HTT?

Ace the Data Science Interview 42


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

5.24. Uber: A fair coin is tossed twice, and you are asked to decide whether it is more likely that
two heads showed up given that either (a) at least one toss was heads, or (b) the second toss
was a head. Does your answer change if you are told that the coin is unfair?
5.25. Facebook: Three ants are sitting at the corners of an equilateral triangle. Each ant randomly
picks a direction and begins moving along an edge of the triangle. What is the probability that
none of the ants meet? What would your answer be if there are, instead, k ants sitting on all
It corners of an equilateral polygon?
5.26. Robinhood: A biased coin, with probability p of landing on heads, is tossed n times. Write a
recurrence relation for the probability that the total number of heads after n tosses is even.
5.27. Citadel: Alice and Bob are playing a game together. They play a series of rounds until one of
them wins two more rounds than the other. Alice wins a round with probability p. What is the
probability that Bob wins the overall series?
5.28. Google: Say you have three draws of a uniformly distributed random variable between (0, 2).
What is the probability that the median of the three is greater than 1.5?

Hard
5.29. D.E. Shaw: Say you have 150 friends, and 3 of them have phone numbers that have the last
four digits with some permutation of the digits 0, 1, 4, and 9. What's the probability of this
occuring?
5.30. Spotify: A fair die is rolled n times. What is the probability that the largest number rolled is r,
for each r in 1,..,6?
5.31. Goldman Sachs: Say you have a jar initially containing a single amoeba in it. Once every
minute, the amoeba has a 1 in 4 chance of doing one of four things: (1) dying out, (2) doing
nothing, (3) splitting into two amoebas, or (4) splitting into three amoebas. What is the
probability that the jar will eventually contain no living amoeba?
5.32. Lyft: A fair coin is tossed n times. Given that there were k heads in the n tosses, what is the
probability that the first toss was heads?
5.33. Quora: You have N i.i.d. draws of numbers following a normal distribution with parameters µ
and . What is the probability that k of those draws are larger than some value Y?
5.34. Akuna Capital: You pick three random points on a unit circle and form a triangle from them.
What is the probability that the triangle includes the center of the unit circle?
5.35. Citadel: You have r red balls and w white balls in a bag. You continue to draw balls from the
bag until the bag only coritains balls of one color. What is the probability that you run out of
white balls first?

Probability Interview Solutions


Solution #5.1
For the series to go to 7 games, each team must have won exactly three times for the first 6 games,
an occurrence having probability

(63) = 20 = 5
6
2 64 16

Ace the Data Science Interview 42


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

where the numerator is the number of ways of splitting up 3 games won by either side, and the
denominator is the total number of possible outcomes of 6 games.

Ace the Data Science Interview 42


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Solution #5.2
Note that there are only two ways for 6s to be consecutive: either the pair happens on rolls 1 and 2
or 2 and 3, or else all three are 6s. In the first case, the probability is given by

( )( )
2
5 1 10
2∗ =
6 6 216
and, for all three, the probability is

()
3
1 1
=
6 216

10 1 11
The desired probability is given by: + =
216 216 216

Solution #5.3
First, note that the three rolls must all yield different numbers; otherwise, no strictly increasing order
is possible. The probability that the three numbers will be different is given by the following
reasoning. The first number can be any value from 1 through 6, the second number has a 5/6 chance
of not being the same number as the first, and the third number has a 4/6 chance of not being the
prior two numbers. Thus,
1∗5
∗4
6 5
=
6 9
Conditioned on there being three different numbers, there is exactly one particular sequence that
will be in a strictly increasing order, and this sequence occurs with probability 1/3! = 1/6
5
∗1
Therefore, the desired probability is given by: 9 5
=
6 54

Solution #5.4
Note that there are a total of (1002)=4950
ways to choose two cards at random from the 100. There are exactly 50 pairs that satisfy the
condition: (1, 2),…,(50, 100). Therefore, the desired probability is:
50
=0.01
4950

Solution #5.5
Note that getting to (3, 3, 3) requires 9 moves. Using these 9 moves, it must be the case that there
are exactly three Inoves in each of the three directions (up, right, and forward). There are therefore
9! ways to order the 9 moves in any given direction. We must divide by 3! for each direction to avoid
overcounting, since each up move is indistinguishable. Therefore, the number of paths is:
Ace the Date Science Intaview 50
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

9!
=1680
3! 3 ! 3 !

Solution #5.6
Let A denote the event that someone has the disease, and B denote the event that this person tests
positive for the disease. Then we want: P(A\B)

P ( B ¿ ) P( A)
By applying Bayes' theorem, we obtain: P( A ¿)=
P(B)
From the problem description, we know that P(B ¿)=0.98 , P( A)=0.001
Let A' denote the event that someone does not have the disease. Then, we know that P(B\A') = 0.01.
For the denominator, we have:
P(B) = P(B\A)P(A) + P(B\A’)P(A’) = 0.98(0.001) + 0.01(0.999)
Therefore, after combining terms, we have the following:
0.98∗0.001
P ( A ¿ )= =8.93 %
0.98 ( 0.001 ) +0.01( 0.999)
Solution #5.7
We can use Bayes' theorem here. Let U denote the case where we are flipping the unfair coin and F
denote the case where we are flipping a fair coin. Since the coin is chosen randomly, we know that
P(U) = P(F) = 0.5. Let 5T denote the event of flipping 5 tails in a row. Then, we are interested in
solving for P(U 5T ), i.e., the probability that we are flipping the unfair coin, given that we obtained 5
tails in a row.
We know P(5T\U) = 1, since, by definition, the unfair coin always results in tails. Additionally, we
know that P(5T\F) = 1/2 5 = 1/32 by definition of a fair coin. By Bayes' theorem, we have:
P ( 5T ¿ )∗P (U ) 0.5
P ( U ¿T )= = =0.97
P ( 5 T ¿ )∗P ( U )+ P (5 T ¿)∗P ( F ) 0.5+ 0.5∗1/32
Therefore, the probability we picked the unfair coin is about 97%.

Solution #5.8
Let P(A) be the probability that A wins. Then, we know the following to be true:
1. If A flips heads initially, A wins with probability 1.
2. If A flips tails initially, and then B flips a tail, then it is as if neither flip had occurred, and so A wins
with probability P(A).
Combining the two outcomes, we have: P(A) = p + (1 - p) 2P(A), and simplifying this yields
P(A) = p + P(A) - 2pP(A) + p2P(A) so that p2P(A) - 2pP(A) +P = 0
1
and hence: P (A) =
2− p

Solution #5.9

Ace the Data Science Inteview 45


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Let R denote the event that it is raining, and Y be a "yes" response when you ask a friend if it is
raining. Then, from Bayes' theorem, we have the following:
P (YYY ¿ ) P(R)
P ( R ¿ )=
P(YYY )

where the numerator is given by:

( ) ( 14 )= 272
3
2
P ( YYY ¿ ) P ( R )=
3
Let R' denote the event of no rain; then the denominator is given by the following:

( ) ( ) ( ) ( 34 )
3 3
2 1 1
P ( YYY )=P ( YYY ¿ ) P ( R ) + P ( YYY ¿ ' ) P ( R ' )= +
3 4 3
11
which, when simplified, yields: P ( YYY )= 108
2
27 8
Combining terms, we obtain the desired probability: P ( R ¿ )= =
11 11
108

Solution #5.10
By definition, a chord is a line segment where the two endpoints lie on the circle. Therefore, two
arbitrary Chords can always be represented by any four points chosen on the circle. If you choose to
represent the first chord by two of the four points, then you have:

( 42)=6
choices of choosing the two points to represent chord 1 (and, hence the other two will represent
chord 2). However, note that in this counting, we are duplicating the count of each chord twice, since
a chord with endpoints p1 and p2 is the same as a chord with endpoints p2 and p1. That is, chord AB
is the same as DA, (likewise with CD and DC). Therefore, the proper number of valid chords is:
1 4
2 2 ()
=3

Among these three configurations, only one of the chords will intersect; hence, the desired
probability is:
1
p=
3
Solution #5.11
Although there is a formal way to apply Markov chains to this problem, there is a simple trick that
simplifies the problem greatly. Note that, if T is ever flipped, you cannot then reach HH before your
friend reaches TH, since the first heads thereafter will result in them winning. Therefore, the
probability of you winning is limited to just flipping an HH initially, which we know is given by the
following probability:

Ace the Data Science Inteview 45


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

1
∗1
2 1
P ( HH ) = =
2 4
Therefore, you have a 1/4 chance of winning, whereas your friend has a 3/4 chance.

Solution #5.12
The price you would be willing to pay is equal to the expectation of the final amount. Note that, for
the first roll, the expectation is
6

∑ 6i = 21
6
=3.5
i=1

Therefore, there are two events on which you need to condition. The first is on getting a 1, 2, or 3 on
the first roll, in which case you would roll again (since a new roll would have an expectation of 3.5,
and so, overall, you have an expectation of 3.5. The second is on if you roll a 4, 5, or 6 on the first
roll, in which case you would keep that roll and end the game, and the overall expectation would be
5, the average of 4, 5, and 6. Therefore, the expected payoff of the overall game is
1 1
∗3.5+ ∗5=4.25
2 2
Therefore, you would be willing to pay up to $4.25 to play.

Solution #5.13
Let D denote the case where a rater is diligent, and E the case where a rater is non-diligent. Further,
let 4N denote the case where four pieces of content are labeled as non-spam. We want to solve for
P(D\4N), and can use Bayes' theorem as follows to do so:
P ( 4 N ¿ )∗P(D)
P ( D¿ N )=
P ( 4 N ¿ )∗P ( D ) + P ( 4 N ¿ )∗P(E)
We are given that P(D) = 0.9, P(E) = 0.1. Also, we know that P ( 4 N ¿ ) = 0.8 * 0.8 * 0.8 * 0.8 due to
the independence of each of the 4 labels assigned by a diligent rater. Similarly, we know that P(4N\E)
= 1, since a non-diligent rater always labels content as non-spam. Substituting into the equation
above yields the following:
P ( 4 N ¿ )∗P(D) 4
0.8 ∗0.9
= 4 =0.79
P ( 4 N ¿ )∗P ( D ) + P ( 4 N ¿ )∗P(E) 0.8 ∗0.9+1 4∗0.1

Therefore, the probability that the rater is diligent is 79%.

Solution #5.14
This is a tricky problem, because your mind probably jumps to the answer of 1/2 because knowing
the gender of one child shouldn't affect the gender of the other. However, the phrase "the second
child is also a boy" implies that we want to know the probability that both children are boys given
that one is a boy. Let B represent a boy and G represent a girl. We then have the following total
sample space representing the possible genders of 2 children: BB, BG, GB, GG.

Ace the Data Science Inteview 45


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

However, since one child was said to be a boy, then valid sample space is reduced to the following:
BB, BG, GB.
Since all of these options are equally likely, the answer is simply 1/3.

Solution #5.15
Let A denote the event that there is a letter in the 8th drawer, and B denote the event that the first 7
drawers are all empty.
The probability of B occurring can be found by conditioning on whether a letter was put in the
drawers or not; if so, then each drawer is equally likely to contain a letter, and if not, then none
contain the letter. Therefore, we have the following:

( 12 )( 18 )+( 12 ) ( 1)= 169


P ( B )=

For A and B to both occur, we also know that: P ( A ∩B )= ( 12 )( 18 )= 161


P(A∩B) 1
Therefore, we have: P ( A ¿ )= =
P(B) 9

Solution #5.16
We can use a recursive formulation. Let p be the probability that the first player wins. Assume the
score is 0-0 (on a relative basis).
If the first player wins a game (with probability 0.6), then two outcomes are possible: with
probability 0.6 the first player wins, and with probability 0.4 the score is back to 0-0, with p being the
probability of the first player winning overall.
Similarly, if the first player loses a game (with probability 0.4), then with probability 0.6 the score is
back to 0-0 (with p being the probability of the first player winning), or, with probability 0.4, the first
player loses. Therefore, we have: p = 0.62 + 2(0.6)(0.4)p
Solving this yields the following for p: p  0.692
The key idea to solving this and similar problems is that, after two points, either the game is over, or
we're back where we started. We don't need to ever consider the third, fourth, etc., points in an
independent way.

Solution #5.17
The first card will always be a unique color and number, so let's consider the second card. Let A be
the event that the color of card 2 does not match that of card 1, and let B be the event that the
number of card 2 does not match that of card 1. Then, we want to find the following:
P ( A ∩B )
Note that the two events are mutually exclusive: two cards with the same colors cannot have the
same numbers, and vice versa: Hence, P ( A ∩B )=P ( A ) P (B ¿)
For A to occur, there are 40 remaining cards of a color different from that of the first card drawn (and
49 remaining cards altogether). Therefore,

Ace the Data Science Inteview 45


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

40
P ( A )=
49
For B, we know that, of the 40 remaining cards, 36 of them (9 in each color) do not have the same
number as that of card 1.
36
Therefore, P ( B ¿ )=
40
40
∗36
Thus, the desired probability is: 49 36
P ( A ∩B )= =
40 49

Solution #5.18
Consider the first nine dice. The sum of those nine dice will be either 0, 1, 2, 3, 4, or 5 modulo 6.
Regardless of that sum, exactly one value for the tenth die will make the sum of all 10 divisible by 6.
For instance, if the sum of the first nine dice is 1 modulo 6, the sum of the first 10 will be divisible by
6 only when the tenth die shows a 5. Thus, the probability is 1/6 for any number of dice, and,
therefore, the answer is simply 1/6.

Solution #5.19
To assess the amount A is willing to pay, we need to calculate the expected probabilities of winning
for each player, assuming A goes first. Let the probability of A winning (if A goes first) be given by
P(A), and the probability of B winning (if A goes first but doesn't win on the first roll) be P(B’).
1 5 '
Then we can use the following recursive formulation: P ( A )= + (1−P( B ))
6 6
Since A wins immediately with a 1/6 chance (the first roll is k), or with a 5/6 chance (assuming the
first roll is not a k), A wins if B does not win, with B now going first.
However, notice that, if A doesn't roll side k immediately, then P(B') = P(A), since now the game is
exactly symmetric with player B going first.
1 5 5
Therefore, the above can be modeled as follows: P ( A )= + − P( A)
6 6 6
Solving yields P(A) 6/11, and P(B) = 1 - P(A) = 5/11. Since the payout is $100, then A should be willing
to pay an amount up to the difference in expected values in going first, which is 100 * (6/11 - 5/11) =
100/11, or about $9.09.

Solution #5.20
Let P(H) be the probability of landing on heads, and P(T) be the probability of landing tails for any
given flip, where P(H) + P(T) = 1. Note that it is impossible to generate fair odds using only one flip. If
we use two flips, however, we have four outcomes: HH, HT, TH, and TT. Of these four outcomes, note
that two (HT, TM) have equal probabilities since P(H) * PO) = P(T) * P(H). We can disregard HH and TT
and need to complete only two sets of flips, e.g., HHT wouldn't be equivalent to HT.

Therefore, it is possible to generate fair odds by flipping the unfair coin twice and assigning heads to
the HT outcome on the unfair coin, and tails to the TH outcome on the unfair coin.

Solution #5.21
Ace the Data Science Inteview 45
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

The only possible candidates for the cube you selected are the following: either it is the inside center
piece (in which case all faces are white) or a middle face (where 5 faces are white, and one-face is
green). The former can be placed in six different ways, and the latter can only be placed in one
particular way. Since all cubes are chosen equally randomly, let A be the event that the bottom face
of the cube picked is white, and B be the event that the other five faces are white.
Note that there is a 1/27 chance that the piece is the center piece and a 6/27 chance that the piece
is the middle piece. Therefore, the probability of B happening is given by the following:

P ( B )=
1
27
( 1) + ()
6 1
27 6
1
P(A∩B) 27 1
P ( A ¿ )= = =
Then, using Bayes' rule: P(B) 6 2
∗1
1 27
+
27 6
Solution #5.22
Assume that the stick looks like the following, with cut points at X and Y
----------X-----|-----Y-----------
Let M (shown as I above) denote the stick's midpoint at 0.5 of the stick's I-unit length. Note that, if X
and Y fall on the same side of the midpoint, either on its left or its right, then no triangle is possible,

Ace the Data Science Inteview 45


ACE THE DATA SCIENCE INTERVIEW & SINGH

because, in that case, the length of one of the pieces would be greater than 1/2 (and thus we would
have two sides having a total length strictly less than that of the longest side, making forming a
triangle impossible). The probability that X and Y are on the same side (since the breaks are assumed
to be chosen randomly) is simply 1/2.
Now, assume that Xand Y fall on different sides of the midpoint. If X is further to the left in its half
than Y is in its half, then no triangle is possible in that case, since then the pan lying between X and Y
would have a length strictly greater than 0.5 (for example, X at 0.2 and Y at 0.75), This has a 1/2
chance of occurring by a simple symmetry argument, but it is conditional on X and Y being on
different sides of the midpoint, an outcome which itself has a 1/2 chance of occurring. Therefore,
this case occurs with probability 1/4. The two cases represent all cases in which no valid triangle can
be formed; thus, it follows that probability of a valid triangle being formed equals 1 — 1/2 — 1/4 =
1/4.

Solution #5.23
Note that both sequences require a heads first, and any sequence of just tails prior to that is
irrelevant to either showing up. Once the first H appears, there are three possibilities. If the next flip
is an H, HHT will inevitably appear first, since the next T will complete that sequence. This has
probability 1/2.
If the next flip is a T, there are two possibilities. If TT appears, then HTT appeared first. This has
probability 1/4. Alternatively) if TH appears, we are back in the initial configuration of having gotten
the first H. Thus, we have:
1 1
p= + p
2 4
2
Solving yields p=
3
Solution #5.24
Let A be the event that the first toss is a heads and B be the event that the second toss is a heads.
Then, for the first case, we are assessing: P(A  B\A  B), whereas for the second case we are
assessing: P(A  B\B)
1
P( A B) P( A B) P( A B) 4 1
For the first case, we have: P ( A B ¿ B )= = = =
P( A B) P( A B) 3 3
4
1
P( A B)P (B) P( A B) 4 1
And, for the second case, we have: P ( A B ¿ )= = = =
P (B) P(B) 1 2
2
Therefore, the second case is more likely. For an unfair coin, the outcomes are unchanged, because it
will always be true that P ( A B )> P (B), so the first case will always be less probable than the second
case.

Solution-#5.25

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

Note that the ants are guaranteed to collide unless they each move in the exact same direction. This
only happens when all the ants move clockwise or all move counter-clockwise (picture the triangle
in 2D). Let P(N) denote the probability of no collision, P(C) denote the case where all ants go
clockwise, and P(D) denote the case where all ants go counterclockwise. Since every ant can choose
either direction with equal probability, then we have:

()()
3 3
1 1 1
P ( N )=P ( C )+ P ( D ) = + =
2 2 4

If we extend this reasoning to k ants, the logic is still the same, so we obtain the following:

() ()
k k
1 1 1
P ( N )=P ( C )+ P ( D ) = + = k−1
2 2 2

Solution #5.26
Lel A be the event that the total number of heads after n tosses is even, B be the event that the first
toss was tails, and B' be the event that the first Loss was heads. By the law of total probability, we
have the following: P ( A )=P ( A ¿ ) P ( B )+ P ( A {B¿¿ ' )¿ (B ' )
Then, we can write the recurrence relation as follows: Pn= (1− p ) P n−1+ p( 1−Pn−1)

Solution #5.27
Note that since Alice can win with probability p, Bob, by definition, can win with probability 1-p.
Denote 1-p as q for convenience. Let B represent the event that Bob wins i matches for i = 0, 1, 2. Let
B* denote the event that Bob wins the entire series. We can use the law of conditional probability as
follows:
P(B*) = P(B* \B2) * P(B2) + P(B* \B1) * P(B1) + P(B* \B0) * P(B0)
Since Bob wins each round with probability 1-p, we have: P(B2) = q2 ,P(B1) = 2pq, P(B0) = P2
Substituting these values into the above expression yields: P(B*) = 1 * q2 + P(B*) *2pq + 0 * P2
Hence, the desired probability is the following: P ¿

Solution #5.28
Because the median of three numbers is the middle number, the median is at least 1.5 if at most one
of the 3 is strictly less than 1.5 (since the other 2 must be strictly greater than 1.5). Since each is
uniformly randomly distributed, then the probability of any one of them being strictly less than 1.5 is
given by the following:
1.5 3
=
2 4
Therefore, the chance that at most one is strictly less than 1.5 is given by the sum of probabilities for
exactly one being strictly less than 1.5 and none being strictly less than 1.5

( )( )( ) ( )
2 3
3 1 1 10
p= 3 + =
1 4 4 4 64

Solution #5.29
Let p be the probability that a phone number has the last 4 digits involving only the above given
digits (0, 1, 4 and 9).

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

We know that the total number of possible last 4 digit combinations is 10 4 = 10000, since there are
10 digits (0-9). There are 4! ways to pick a 4-digit permutation of 0, 1, 4 and 9.
4! 3
Therefore, we have: p= =
10000 1250
Now, since you have 150 friends, the probability of there being exactly 3 with this combination is
given by:

(1503) p (1− p)
3 147
=0.00535

Solution #5.30
Let B be the event that all n rolls have a value less than or equal to r. Then we have:
n
r
P (B r )= n
6
since all n rolls must have a value less than or equal to r. Let A be the event that the largest number
is r. We have: Br=Br −1 A , and, since the two events on thc right-hand side are disjoint, we have the
following: P(B1) = P(Br-1) + P(Ar)
n
r n (r −1)
Therefore, the probability of A is given by: P ( A r )=P ( B r )−P ( Br −1) = −
6n 6n

Solution #5.31
Let p be the probability that the amoeba(s) die out. At any given time step, the probability of dying
out eventually must still be p.
For case (1), we know the probability of survival is 0 (for one amoeba).
For case (2); we know the probability of dying out is p.
For case (3), there are now two amoebas, and both have a probability p of dying.
For case (4), each of the three amoebas has a probability p of dying.
Putting all four together, we note that the probability of the population dying out at t = 0 minutes
must be the same as the probability of the population dying out at t = 1 minutes. Therefore, we
have:
1 2 3
p= (1+ p+ p + p )
4
and solving this yields: p= √ 2−1

Solution #5.32
Note that there are (n−1
k )
ways to choose k heads with the first coin being a T, and a total of

( nk)

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

ways to obtain k heads. So, the probability of having a tails first is given by:
( k ) n−k
n−1
=
(k ) n
n

and, therefore, the probability of obtaining a heads first is given by the following:
n−k k
1− =
n n

Solution #5.33
Let the n draws be denoted as X1,X2,…Xn
We know that, for any given draw i , we have the following:

P ( X i >Y ) =1−P ( X i ≤ Y )=1−P ( X i−u Y −μ


σ

σ ) ( )
=1−
Y −μ
σ

Ace the Data Science Interview


Additionally, the probability that h of those draws are greater than Y follows a binomial distribution
with the value above being the p parameter:

p=1− ( Y −μ
σ )

( n) p (1− p)
Then, the desired probability is given by: k
k n−k

Solution #5.34
Note that, without loss of generality, the first point can be located at (1, 0). Using the polar
coordinate system, we have the two other points at angles: , , respectively.
Note that the second point can be placed on either half (top or bottom) without loss (if generality.
Therefore, assume that it is on the top half. Then, 0 <  < 
If the third point is also in the top half, then the resulting triangle will not contain the center of the
unit circle. It will also not contain the center if the following is the case (try drawing this out):
+
Therefore, for any given second point, the probability of the third point making the resulting triangle
contain the center of the unit circle is the following:
θ
p=

Therefore, the overall probability is given by the integrating over possible values of θ , where the
constant in front is to take the average:
n
1 1

π 0
pdθ=
4

Solution #5.35
In order to run out of white balls first, all the white balls must be drawn before the r-th red ball is
drawn. We can consider the draws until w + r — 1 (we know the last ball must be red), and count
how many include w white balls.
The first white ball has w + r - 1 options, the second white ball has w + r - 2 options, etc., until the
drawing of the w-th white ball: (w + r - 1)(w + r - 2)...(r), which can be written as a factorial:
( w+r −1 ) !
( r −1 ) !
Similarly, there are r! ways to arrange the drawing of the remaining r red balls. We know the total
number of balls is r + w, so there are (r + w)! total arrangements. Therefore, the probability is:
( w+r−1 ) !
r!
( r−1 ) ! r
=
( r +w ) ! w+r
A more intuitive way to approach the problem is to consider just the last ball drawn. The probability
that the ball is red is simply the chance of it being red when picking randomly, which is the following:
r
w+r
Statistics
CHAPTER 6

Statistics is a core component of any data scientists toolkit, Since many commercial layers,
Of a data science pipeline are built from statistical foundations (for example, A/B testing),
knowing foundational topics of statistics is essential.
Interviewers love to test a candidate knowledge about the basics of statistics, starting with
topics like the Central Limit Theorem and the Law of Large Numbers, and then progressing
on to the concepts underlying hypothesis-testing, particularly p-values and confidence
intervals, as well as Type I and Type II errors and their interpretations. All of those topics
play an important role in the statistical underpinning of A/B testing. Additionally, derivations
and manipulations involving random variables of various probability distributions are also
common, particularly in finance interviews. Lastly, a common topic in more technical
interviews will involve utilizing MLE and/or MAP.

Topics to Review Before Your Interview


Properties of Random Variables
For any given random variable X, the following properties hold true (below we assume X is
continuous, but it also holds true for discrete random variables).
The expectation (average value, or mean) of a random variable is given by the integral of the value of
X with its probability density function (PDF) fx( x) :

Ace the Data Science Interview 62



μ=E ( X )=∫ xf x (x)dx

and the variance is given by:

Var ( X )=E [ ( X −E|X|) ] =E| X |−(E| X|)


2 2 2

The variance is always non-negative, and its square root is called the standard deviation, which is
heavily used in statistics.


σ =√ Var ( X )= E [( X−E| X|) ]= E|X |−(E| X|)
2
√ 2 2

The conditional values of both the expectation and variance are as follows. For example, consider
the case for the conditional expectation of X, given that Y = y:

E [ X∨Y = y ] =∫ xf X ∨Y ( x∨ y)dx

For any given random variables X and Y, the covariance, a linear measure of relationship between the
two variables, is defined by the following:
Cov ( X , Y ) =E [ ( X −E|X|)(Y −E|Y |) ] =E| XY|−E| X| E|Y |
and the normalization of covariance, represented by the Greek letter , is the correlation between X
and Y:
Cov ( X ,Y )
ρ ( X , Y )=
√ Var ( X ) Var ( Y )
All of these properties are commonly tested in interviews, so it helps to be able to understand the
mathematical details behind each and walk through an example for each.
For example, if we assume X follows a Uniform distribution on the interval [a, b], then we have the
following:
1
fx ( x ) =
b−a
Therefore the expectation of X is:

|
b b 2
x x b a+b
E ( X )=∫ xf x ( x ) dx =∫ dx= =
a a b−a 2(b−a) a 2

Although it is not necessary to memorize the derivations for all the different probability
distributions, you should be comfortable deriving them as needed, as it is a common request in
more technical interviews. To this end, you should make sure to understand the formulas given
above and be able to apply them to some of the common probability distributions like the
exponential or uniform distribution.

Law of Large Numbers


The Law of Large Numbers (LLN) states that if you sample a random variable independently a large
number of times, the measured average value should converge to the random variable's true
expectation. Stated more formally,
X 1+ ...+ X n
X n= → μ , as n → ∞
n

Ace the Data Science Interview 63


This is important in studying the longer-term behavior of random variables over time. As an example,
a coin might land on heads 5 times in a row, but over a much larger n we would expect the
proportion

Ace the Data Science Interview 64


ACE THE DATA SCIENCE INTERVIEW & SINGH

of heads to be approximately half of the total flips. Similarly, a casino might experience a loss on any
individual game, but over the long run should see a predictable profit over time.

Central Limit Theorem


Tha Central Limit Theorem (CLT) states that if you repeatedly sample a random variable a large
number of times, the distribution of the sample mean will approach a normal distribution regardless
of the initial distribution of the random variable.
Recall from the probability chapter that the normal distribution takes on the form:

f x ( x) =
1
√2 π σ 2
exp−
[
(x−μ)2
2 σ2 ]
with the mean and standard deviation given by µ and  respectively.
X 1+ ...+ X n
( ) X n−μ
2
σ
The CLT states that: X n= → N μ, ; hence N (0 , 1)
n n σ /√n
The CLT provides the basis for much of hypothesis testing, which is discussed shortly. At a very basic
level, you can consider the implications of this theorem on coin flipping: the probability of getting
some number of heads flipped over a large n should be approximately that of a normal distribution.
Whenever you're asked to reason about any particular distribution over a large sample size, you
should remember to think of the CLT, regardless of whether it is Binomial, Poisson, or any other
distribution.

Hypothesis Testing
General Setup
The process of testing whether or not a sample of data supports a particular hypothesis is called
hypothesis testing, Generally, hypotheses concern particular properties of interest for a given
population, such as its parameters, like µ (for example, the mean conversion rate among a set of
users). The steps in testing a hypothesis are as follows:
1. State a null hypothesis and an alternative hypothesis. Either the null hypothesis will be rejected
(in favor of the alternative hypothesis), or it will fail to be rejected (although failing to reject the
null hypothesis does not necessarily mean it is true, but rather that there is not sufficient
evidence to reject it).
2. Use a particular test statistic of the null hypothesis to calculate the corresponding -value.
3. Compare the -value to a certain significance level a .
Since the null hypothesis typically represents a baseline (e.g., the marketing campaign did not
increase conversion rates, etc.), the goal is to reject the null hypothesis with statistical significance
and hope that there is a significant outcome.
Hypothesis tests are either one- or two-tailed tests. A one-tailed test has the following types of null
and alternative hypotheses:
H 0 : μ=μ 0 Versus H 1 : μ< μ 0∨H 1 : μ> μ 0

whereas a two-tailed test has these types: H 0 : μ=μ 0 Versus H 1 : μ ≠ μ0

Ace the Data Science Interview 65


ACE THE DATA SCIENCE INTERVIEW I & SINGH

where H0 is the null hypothesis and H1 is the alternative hypothesis, and is the parameter of interest.

Understanding hypothesis testing is the basis of A/B testing, a topic commonly covered in tech
companies' interviews. In A/B testing, various versions of a feature are shown to a sample of
different users, and each variant is tested to determine if there was an uplift in the core engagement
metrics.
Say, for example, that you are working for Uber Eats, which wants to determine whether email
campaigns will increase its product's conversion rates. To conduct an appropriate hypothesis test,
you would need two roughly equal groups (equal with respect to dimensions like age, gender,
location, etc.), One group would receive the email campaigns and the other group would not be
exposed. The null hypothesis in this case would be that the two groups exhibit equal conversion
rates, and the hope is that the null hypothesis would be rejected.

Test Statistics
A test statistic is a numerical summary designed for the purpose of determining whether the null
hypothesis or the alternative hypothesis should be accepted as correct. More specifically, it assumes
that the parameter of interest follows a particular sampling distribution under the null hypothesis.
For example, the number of heads in a series of coin flips may be distributed as a binomial
distribution, but with a large enough sample size, the sampling distribution should be approximately
normally distributed. Hence, the sampling distribution for the total number of heads in a large series
of coin flips would be considered normally distributed.
Several variations in test statistics and their distributions include:
1. Z-test: assumes the test statistic follows a normal distribution under the null hypothesis
2. I-test: uses a student's t-distribution rather than a normal distribution
3. Chi-squared: used to assess goodness of fit, and to check whether two categorical variables are
independent

Z-Test
Generally the Z-test is used when the sample size is large (to invoke the CLT) or when the population
variance is known, and a t-test is used when the sample size is small and when the population
variance is unknown. The Z-test for a population mean is formulated as:
x−μ0
z= N (0 ,1)
σ /√n
in the case where the population variance σ 2 is known.

t-Test
The t-test is structured similarly to the Z-test, but uses the sample variance s 2 in place of population
variance. The t-test is parametrized by the degrees of freedom, which refers to the number of
independent observations in a dataset, denoted below by n – 1 :
x−μ0
t= t
s/√n

Ace the Data Science Interview 66


ACE THE DATA SCIENCE INTERVIEW I & SINGH

where
∑ (x1 −x)2
s2= 1=1
n−1

As stated earlier, the t-distribution is similar to the nomal distribution in appearance but has larger
tails (i.e., extreme events happen with greater frequency than the modeled distribution would
predict), a common phenomenon, particularly in economics and Earth sciences.

Chi-Squared Test
The Chi-squared test statistic is used to assess goodness of fit, and is calculated as follows:
2
(Oi−E i)
x =∑
2

i Ei

where Oi is the observed value of interest and E i is its expected value. A Chi-squared test statistic
takes on a particular number of degrees of freedom, which is based on the number of categories in
the distribution.
To use the squared test to check whether two categorical variables are independent, create a table
of counts (called a contingency table), with the values of one variable forming the rows of the table
and the values of the other variable forming its columns, and check for intersections. It uses the
same style of Chi-squared test statistic as given above.

Hypothesis Testing for Population Proportions


Note that, due-to the CLT, the Z-test can be applied to random variables of any distribution. For
example, when estimating the sample proportion of a population having a characteristic of interest,
we can view the members of the population as Bernoulli random variables, with those having the
characteristic represented by "Is" and those lacking it represented by "0s". Viewing the sample
proportion of interest as the sum of these Bernoulli random variables divided by the total population
size, we can then compute the sample mean and variance of the overall proportion, about which we
can form the following set of hypotheses:
H 0 : ^p = p0 Versus H 1 : ^p ≠ p 0
^p −p 0
and the corresponding test statistic to conduct a Z-test would be: z=
√ p 0 (1− p0 )/n
In practice, these test statistics form the core of A/B testing. For instance, consider the previously
discussed case, in which we seek to measure conversion rates within groups A and B, where A is the
control group and B has the special treatment (in this case, a marketing campaign). Adopting the
same null hypothesis as before, we can proceed to use a Z-test to assess the difference in empirical
population means (in this case, conversion rates) and test its statistical significance at a
predetermined level.
When asked about A/B testing or related topics, you should always cite the relevant test statistic and
the cause of its validity (usually the CCF).

p-values and Confidence Intervals

Ace the Data Science Interview 67


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Both p-values and confidence intervals are commonly covered topics during interviews. Put simply, a
p-value is the probability of observing the value of the calculated test statistic under the null
hypothesis assumptions. Usually, the p-value is assessed relative to some predetermined level of
significance (0.05 is often chosen).

In conducting a hypothesis test, an , or measure of the acceptable probability of rejecting a true


null hypothesis, is typically chosen prior to conducting the test. Then, a confidence interval can also
be calculated to assess the test statistic. This is a range of values that, if a large sample were taken,
would contain the parameter value of interest (1- )% of the time, For instance, a 95% confidence
interval would contain the true value 95% of the time. If 0 is included in the confidence intervals,
then we cannot reject the null hypothesis (and vice versa).
The general form for a confidence interval around the population mean looks like the following,
where the term is the critical value (for the standard normal distribution):
σ
μ ± z 0 /2
√n
In the prior example with the A/B testing on conversion rates, we see that the confidence interval for
a population proportion would be

^p ± z 0 /2
√ ^p (1− ^p )
n
since our estimate of the true proportion will have the following parameters when estimated as
approximately Gaussian:
np np(1− p) p (1− p)
μ= =p ,σ 2= =
n n2 n
As long as the sampling distribution of a random variable is known, the appropriate p-values and
confidence intervals can be assessed.
Knowing how to explain p-values and confidence intervals; in technical and nontechnical terms, is
very useful during interviews, so be sure to practice these. If asked about the technical details,
always remember to make sure you correctly identify the mean and variance at hand.

Type I and II Errors


There are two errors that are frequently assessed: type I error, which is also known as a "false
positive," and type II error, which is also known as a "false negative." Specifically, a type error is
when one rejects the null hypothesis when it is correct, and a type II error is when the null
hypothesis is not rejected when it is incorrect.
Usually l-a is referred to as the confidence level, whereas I-ß is referred to as the power. If you plot
sample size versus power, generally you should see a larger sample size corresponding to a larger
power. It can be useful to look at power in order to gauge the sample size needed for detecting a
significant effect, Generally, tests are set up in such a way as to have both I-a and I-ß relatively high
(say at 0.95 and 0.8, respectively).
In testing multiple hypotheses, it is possible that if you ran many experiments — even if a particular
outcome for one experiment is very unlikely — you would see a statistically significant outcome at
least once. So, for example, if you set 0.05 and run 100 hypothesis tests, then by pure chance you
would expect 5 of the tests to be statistically significant. However, a more desirable outcome is to

Ace the Data Science Interview 68


ACE THE DATA SCIENCE INTERVIEW I & SINGH

have the overall  of the 100 tests be 0.05, and this can be done by setting the new  to /n, where
n is the number of hypothesis tests (in this case, /n = 0.05/100 = 0.0005). This is known as
Bonferroni correction, and using it helps make sure that the overall rate of false positives is
controlled within a multiple testing framework.

Ace the Data Science Interview 69


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Generally, most interview questions concerning Type I and II errors are qualitative in nature — for
instance, requesting explanations of terms or of how you would go about assessing errors/power in
an experimental setup.

MLE and MAP


Any probability distribution has parameters, so fitting parameters is an extremely crucial part of data
analysis. There are two general methods for doing so. In maximum likelihood estimation (MLE), the
goal is to estimate the most likely parameters given a likelihood function: MLE = arg max L(), where
L()=f n (x 1 … x n∨θ)
Since the values of X are assumed to be i.i.d., then the likelihood function becomes the following:
n
L()=∏ f (x i∨θ)
i=1

The natural log of L( ) is then taken prior to calculating the maximum; since log is a monotonically
increasing function, maximizing the log-likelihood log L() is equivalent to maximizing the likelihood:
n
log L() ∑ log f (x i∨θ)
i−1

Another way of fitting parameters is through maximum a posteriori estimation (MAP), which
assumes a "prior distribution:"
θ MAP=arg max g ( θ ) f (x 1 … x n∨θ)
where the similar log-likelihood is again employed, and g() is a density function of .
Both MLE and MAP are especially relevant in statistics and machine learning, and knowing these is
recommended, especially for more technical interviews. For instance, a common question in such
interviews is to derive the MLE for a particular probability distribution. Thus, understanding the
above steps, along with the details of the relevant probability distributions, is crucial,

40 Real Statistics Interview Questions


Easy
6.1. Uber: Explain the Central Limit Theorem. Why it is useful?
6.2. Facebook: How would you explain a confidence interval to a non-technical audience?
6.3. Twitter: Whal are some common pitfalls encountered in A/B testing?
6.4. Lyft: Explain both covariance and correlation formulaically, and compare and contrast them.
6.5. Facebook: Say you flip a coin 10 times and observe only one heads. What would be your null
hypothesis and p-value for testing whether the coin is fair or not?
6.6. Uber: Describe hypothesis testing and p-values in layman's terms?
6.7. Groupon: Describe what Type I and Type II Errors are, and the trade-offs between them.
6.8. Microsoft: Explain the statistical background behind power.
6.9. Facebook: What is a Z-test and when would you use it versus a I-test?
6.10. Amazon: Say you are testing hundreds of hypotheses, each with t-test. What considerations
would you take into account when doing this?

Medium

Ace the Data Science Interview 70


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

6.11. Google: How would you derive a confidence interval for the probability of flipping heads
from a series of coin tosses?
6.12. Two Sigma: What is the expected number of coin flips needed to get two consecutive
heads?
6.13. Citadel: What is the expected number of rolls needed to see all six sides of a fair die?
6.14. Akuna Capital: Say you're rolling a fair six-sided die. What is the expected number of rolls
until you roll two consecutive 5s?
6.15. D.E. Shaw: A coin was flipped 1,000 times, and 550 times it showed heads. Do you think the
coin is biased? Why or why not?
6.16. Quora: You are drawing from a normally distributed random variable X N(0, 1) once a day.
What is the approximate expected number of days until you get a value greater than 2?
6.17. Akuna Capital: Say you have two random variables X and Y, each with a standard deviation.
What is the variance of aX + bY for constants a and b?
6.18. Google: Say we have X  Uniform(0, 1) and Y  Uniform(0, 1) and the two are independent.
What is the expected value of the minimum of X and Y?
6.19. Morgan Stanley: Say you have an unfair coin which lands on heads 60% of the time. How
many coin flips are needed to detect that the coin is unfair?
6.20. Uber: Say you have n numbers !...n, and you uniformly sample from this distribution with
replacement n times. What is the expected number of distinct values you would draw?
6.21. Goldman Sachs: There are 100 noodles in a bowl. At each step, you randomly select two
noodle ends from the bowl and tie them together. What is the expectation on the number
of loops formed?
6.22. Morgan Stanley: What is the expected value of the max of two dice rolls?
6.23. Lyft: Derive the mean and variance of the uniform distribution U(a, b).
6.24. Citadel: How many cards would you expect to draw from a standard deck before seeing the
first ace?
6.25. Spotify: Say you draw n samples from a uniform distribution U(a, b). What are the MLE
estimates of a and b?

Hard
6.26. Google: Assume you are drawing from an infinite set of i.i.d random variables that are
uniformly distributed from (0, 1). You keep drawing as long as the sequence you are getting
is monotonically increasing. What is the expected length of the sequence you draw?
6.27. Facebook: There are two games involving dice that you can play. In the first game, you roll
two dice at once and receive a dollar amount equivalent to the product of the rolls. In the
second game, you roll one die and get the dollar amount equivalent to the square of that
value. Which has the higher expected value and why?

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

6.28. Google: What does it mean for an estimator to be unbiased? What about consistent? Give
examples of an unbiased but not consistent estimator, and a biased but consistent
estimator.
6.29. Netflix: What are MLE and MAP? WhaL is the difference between the two?
6.30. Uber: Say you are given a random Bernoulli trial generator. How would you generate values
from a standard normal distribution?
6.31. Facebook: Derive the expectation for a geometric random variable.
6.32. Goldman Sachs: Say we have a random variable X  D, where D is an arbitrary distribution.
What is the distribution F(X) where F is the CDF of X?
6.33. Morgan Stanley: Describe what a moment generating function (MGF) is. Derive the MGF for
a normally distributed random variable X.
6.34. Tesla: Say you have N independent and identically distributed draws of an exponential
random variable. What is the best estimator for the parameter X?
6.35. Citadel: Assume that log X  N(0, 1). What is the expectation of X?
6.36. Google: Say you have two distinct subsets of a dataset for which you know their means and
standard deviations. How do you calculate the blended mean and standard deviation of the
total dataset? Can you extend it to K subsets?
6.37. Two Sigma: Say we have two random variables X and Y. What does it mean for X and Y to be
independent? What about uncorrelated? Give an example where X and Y are uncorrelated
but not independent.
6.38. Citadel: Say we have X  Uniform (-1, 1) and Y=X^2. What is the covariance of. X and Y?
6.39. Lyft: How do you uniformly sample points at random from a circle with radius R?
6.40. Two Sigma: Say you continually sample from some i.i.d. uniformly distributed (0, 1) random
variables until the sum of the variables exceeds 1. How many samples do you expect to
make?

40 Real Statistics Interview Solutions


Solution #6.1
The Central Limit Theorem (CLT) states that if any random variable, regardless of distribution, is
sampled a large enough number of times, the sample mean will be approximately normally
distributed This allows for studying of the properties for any statistical distribution as long as there is
a large enough sample size.
The mathematical definition of the CLT is as follows: for any given random variable X, as n
approaches infinity,
X 1+ ...+ X n
( )
2
σ
X n= → N μ
n n
At any company with a lot of data, like Uber, this concept is core to the various experimentation
platforms used in the product. For a real-world example, consider testing whether adding a new
feature increases rides booked in the Uber platform, where each X is an individual ride and is a
Bernoulli random variable (i.e., the rider books or does not book a ride). Then: if the sample size is

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

sufficiently large; we can assess the statistical properties of the total number of bookings, as well as
the booking rate (rides booked / rides opened on app). These statistical properties play a key role in
hypothesis testing, allowing companies like Uber to decide whether or not to add new features in a
data-driven manner.

Solution #6.2
Suppose we want to estimate some parameters of a population. For example, we might want to
estimate the average height of males in the U.S. Given some data from a sample, we can compute a
sample mean for what we think the value is, as well as a range of values around that mean.
Following the previous example, we could obtain the heights of 1,000 random males in the U.S. and
compute the average height, or the sample mean. This sample mean is a type of point estimate and,
while useful, will vary from sample to sample. Thus, we can't tell anything about the variation in the
data around this estimate, which is why we need a range of values through a confidence interval.
Confidence intervals are a range of values with a lower and an upper bound such that if you were to
sample the parameter of interest a large number of times, the 95% confidence interval would
contain the true value of this parameter 95% of the time. We can construct a confidence interval
using the sample standard deviation and sample mean. The level of confidence is determined by a
margin of error that is set beforehand. The narrower the confidence interval, the more precise the
estimate, since there is less uncertainty associated with the point estimate of the mean.

Solution #6.3
A/B testing has many possible pitfalls that depend on the particular experiment and setup
employed. One common drawback is that groups may not be balanced, possibly resulting in highly
skewed results. Note that balance is needed for all dimensions of the groups — like user
demographics or device used — because, otherwise, the potentially statistically significant results
from the test may simply be due to specific factors that were not controlled for. Two types of errors
are frequently assessed: Type I error, which is also known as a "false positive," and Type II error, also
known as a "false negative." Specifically, Type error is rejecting a null hypothesis when that
hypothesis is correct, whereas Type II error is failing to reject a null hypothesis when its alternative
hypothesis is correct.
Another common pitfall is not running an experiment for long enough. Generally speaking,
experiments are run with a particular power threshold and significance threshold; however, they
often do not stop immediately upon detecting an effect- For an extreme example, assume you're at
either Uber or Lyft and running a test for two days, when the metric of interest (e.g., rides booked) is
subject to weekly seasonality.
Lastly, dealing with multiple tests is important because there may be interactions between results of
tests you are running and so attributing results may be difficult. In addition, as the number of
variations you run increases, so does the sample size needed. In practice, while it seems technically
feasible to test 1,000 variations of a button when optimizing for click-through rate, variations in tests
are usually based on some intuitive hypothesis concerning core behavior.

Solution #6.4
For any given random variables X and Y, the covariance, a linear measure of relationship, is defined
by the following: Cov ( X , Y ) =E [ (X −E|X|)(Y −E|Y |) ] =E| XY|−E| X| E|Y |

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Specifically, covariance indicates the direction of the linear relationship between X and Y and can
take on any potential value from negative infinity to infinity. The units of covariance are based on the
units of X and Y, which may differ.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO SINGH

The correlation between X and Y is the normalized version of covariance that takes into account the
variances of X and Y:
Cov ( X ,Y )
ρ ( X , Y )=
√ Var ( X ) Var ( Y )
Since correlation results from scaling covariance, it is dimensionless (unlike covariance) and is always
between -1 and 1 (also unlike covariance).

Solution #65
The null hypothesis is that the coin is fair, and the alternative hypothesis is that the coin is biased
towards tails (note this is a one-sided test):
H0 : P0 = 0.5, H1 : P1 < 0.5
Note that, since the sample size here is 10, you cannot apply the Central Limit Theorem (and so you
cannot approximate a binomial using a normal distribution).
The p-value here is the probability of observing the results obtained given that the null hypothesis is
true, i.e., under the assumption that the coin is fair. In total for 10 flips of a coin, there are 2 ^10 =
1024 possible outcomes, and in only 10 of them are there 9 tails and one heads. Hence, the exact
10
probability of the given result is the p-value, which is =0.0098 . Therefore, with a significance
1024
level set, for example, at 0.05, we can reject the null hypothesis.

Solution #6.6
The process of testing whether data supports particular hypotheses is called hypothesis testing and
involves measuring parameters of a population's probability distribution. This process typically
employs at least two groups one a control that receives no treatment, and the other group(s), which
do receive-the treatment(s) of interest. Examples could be the height of two groups of people, the
conversion rates for particular user flows in a product, etc. Testing also involves two hypotheses —
the null hypothesis, which assumes no significant difference between the groups, and the alternative
hypothesis, which assumes a significant difference in the measured parameter(s) as a consequence
of the treatment.
A p-value is the probability of observing the given Lest results under the null hypothesis
assumptions. The lower this probability, the higher the chance that the null hypothesis should be
rejected. If the p-value is lower than the predetermined significance level d, generally set at 0.05,
then it. indicates that the null hypothesis should be rejected in favor of the alternative hypothesis.
Otherwise, the null hypothesis cannot be rejected, and it cannot be concluded that the treatment
has any significant effect.
Solution #6.7
Both errors are relevant in the context of hypothesis testing. Type I error is when one rejects the null
hypothesis when it is correct, and is known as a false positive. Type II error is when the null
hypothesis is not rejected when the alternative hypothesis is correct; this is known as a false
negative. In layman 's terms, a type I error is when we detect a difference, when in reality there is no
significant difference in an experiment. Similarly, a type II error occurs when we fail to detect a
difference, when in reality there is a significant difference in an experiment.
Type I error is given by the level of significance u, whereas the type II error is given by . Usually, I-
is referred to as the confidence level, whereas I- is referred to as the statistical power of the test
being conducted. Note that, in any well-conducted statistical procedure, we want to have both  and

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

 be small. However, based on the definition of the two} it is impossible to make both errors small
simultaneously: the larger  is, the smaller  is. Based on the experiment and the relative
importance of false positives and false negatives, a data scientist must decide what thresholds to
adopt for any given experiment. Note that experiments are set up so as to have both I-a and I-
relatively high (say at .95, and .8, respectively).

Solution #6.8
Power is the probability of rejecting the null hypothesis when, in fact, it is false. It is also the
probability of avoiding a Type II error. A Type II error occurs when the null hypothesis is not rejected
when the alternative hypothesis is correct. This is important because we want to detect significant
effects during experiments. That is, the higher the statistical power of the test, the higher the
probability of detecting a genuine effect (i.e., accepting the alternative hypothesis and rejecting the
null hypothesis). A minimum sample size can be calculated for any given level of power — for
example, say a power level of 0.8.
An analysis of the statistical power of a test is usually performed with respect to the test's level of
significance () and effect size (i.e., the magnitude of the results).

Solution #6.9
In a Z-test, your test statistic follows a normal distribution under the null hypothesis. Alternatively, in
a I-test, you employ a student's t-distribution rather than a normal distribution as your sampling
distribution.
Considering the population mean, we can use either Z-test or t-test only if the mean is normally
distributed, which is possible in two cases: the initial population is normally distributed, or the
sample size is large enough (n  30) that we can apply the Central Limit Theorem.
If the condition above is satisfied, then we need to decide which type of test is more appropriate to
use. In general, we use Z-tests if the population variation is known, and vice versa: we use t-test if
the population variation is unknown.
Additionally, if the sample size is very large (n > 200), we can use the Z-test in any case, since for such
large degrees of freedom, t-distribution coincides with z-distribution up to thousands.
Considering the population proportion, we can use a Z-test (but not t-lest) where np0  10 and n(1-
P0)  10, i.e., when each of the number of successes and the number of failures is at least 10.
Solution #6.10
The primary consideration is that, as the number of tests increases, the chance that a stand-alone p-
value for any of the t-tests is statistically significant becomes very high due to chance alone As an
example, with 100 tests performed and a significance threshold of  = 0.05, you would expect five of
the experiments to be statistically significant due only to chance. is, you have a very high probability
of observing at least one significant outcome. Therefore, the chance of incorrectly rejecting a null
hypothesis (i.e., committing Type I error) increases.
To correct for this effect, we can use a method called the Bonferroni correction, wherein we set the
significance threshold to /m, where m is the number of tests being performed. In the above
scenario With 100 tests, we can set the significance threshold to instead be 0.05/100 = 0.0005.
While this correction helps to protect from Type I error, it is still prone to Type II error (i.e., failing to
reject the null hypothesis when it should be rejected). In general, the Bonferroni correction is mostly
useful when there is a smaller number of multiple comparisons of which a few are significant. If the

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

number of tests becomes sufficiently high such that many tests yield statistically significant results,
the number of Type II errors may also increase significantly.

Solution #6.11
The confidence interval (CI) for a population proportion is an interval that includes a true population
proportion with a certain degree of confidence 1 — a.
For the case of flipping heads from a series of coin tosses, the proportion follows the binomial
distribution, If the series size is large enough (each of the number of successes and the number of
failures is at least 10), we can utilize the Central Limit Theorem and use the normal approximation
for the binomial distribution as follows:

(
N ^p ,
^p (1− ^p )
n )
where is the proportion of heads tossed in series, and n is the series size. The Cl is centered at the
series proportion, and plus or minus a margin of error:

^p ± z 0 /2
√ ^p (1− ^p )
n
where ^p is the appropriate value from the standard nomal distribution for the desired confidence
level.
For example, for the most commonly used level of confidence, 95%, z 0 /2 = 1.96.

Solution #6.12
Let X be the number of coin flips needed to obtain two consecutive heads. We then want to solve for
E[X]. Let H denote a flip that results in heads, and T denote a flip that results in tails, Note that E[X]
can be written in terms of E[X|H] and E[X|T], i.e., the expected number of flips needed, conditioned
on a flip being either heads or tails, respectively.
1 1
Conditioning on the first flip, we have: E [ X ] = ( 1+ E [ X ¿ ] ) + ( 1+ E [ X ¿ ] )
2 2
Note that E[X|T] = C since if a tail is flipped, we need to start over in getting two heads in a row.
To solve for E[X|H], we can condition it further on the next outcome: either heads (HH) or tails (HT),
1 1
Therefore, we have: E [ X ¿ ] = ( 1+ E [ X ¿ ] ) + ( 1+ E [ X ¿ ] )
2 2
Note that If the result is HH, then E[X|HH] = 0, since the outcome has been achieved. If a tail was
flipped, then E [ X ¿ ] =E ¿ , and we need to start over in attempting to get two heads in a row. Thus:
1 1 1
E [ X ¿ ] = ( 1+0 )+ ( 1+ E [ X ] ) =1+ E [ X ]
2 2 2
Plugging this into the original equation yields:

E [ X ]=
1
2 ( 1 +1
)
1+1+ E [ X ] + ( 1+ E [ X ] )
2 2
and after solving we get: E[X] = 6. Therefore, we would expect 6 flips.

Solution #6.13
Let k denote the number of distinct sides seen from rolls. The first roll will always result in a new side
being seen. If you have seen k sides, where k < 6, then the probability of rolling an unseen value will

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

be (6 - k)/6, since there are 6 - k values you have not seen, and 6 possible outcomes of each
roll.

Ace the Science Interview


CHAPTER 6 STATISTICS

Note that each roll is independent of previous rolls. Therefore, for the second roll (k = 1), the time
until a side not seen appears has a geometric distribution with p = 5/6, since there are five of the six
sides left to be seen. Likewise, after two sides (k = 2), the time taken is a geometric distribution, with
p 4/6. This continues until all sides have been seen,
Recall that the mean for a geometric distribution is given by lip, and let X be the number of rolls
needed to show all six sides. Then, we have the following:
6
6 6 6 6 6 1
E [ X ] =1+ + + + + =6 ∑ =14.7 rolls
5 4 3 2 1 p =1 p

Solution #6.14
Similar in methodology to question 13, let X be the number of rolls until two consecutive fives. Let Y
denote the event that a five was just rolled.
Conditioning on Y, we know that either we just rolled a five, so we only have one more five to roll, or
we rolled some other number and now need to start over after having rolled once:
1 5
E [ X ] = (1+ E [X ∨Y ])+ ( 1+ E [ X ] )
6 6
1 5
Note that we have the following: E [X ∨Y ]= (1)+ (1+ E [ X ] )
6 6
Plugging the results in yields an expected value of 42 rolls: E [ X ] = 42

Solution #6.15
Because the sample size of flips is large (1,000), we can apply the Central Limit Theorem. Since each
individual flip is a Bernoulli random variable, we can assume that p is the probability Of getting
heads. We want to test whether p is .5 (i.e., Whether it is a fair coin or not). The Central Limit
Theorem allows us to approximate the total number of heads seen as being normally distributed.
More specifically, the number of heads seen out of n total rolls follows a binomial distribution since
it a sum of Bernoulli random variables. If the coin is not biased (p = .5), then the expected number of
heads is as follows: µ = np=¿1000 * 0.5 = 500, and the variance of the number of heads is given by:
σ =np ( 1− p ) =1000∗0.5∗0.5=250 , σ =√ 250=16
2

Since this mean and standard deviation specify the normal distribution, we can calculate the
corresponding z-score for 550 heads as follows:
550−500
z= =3.16
16
This means that, if the coin were fair, the event of seeing 550 heads should occur with a < 0.1%
chance under normality assumptions. Therefore, the coin is likely biased.

Solution #6.16
Since X is normally distributed, we can employ the cumulative distribution function (CDF) of the
normal distribution: = ( 2 )=P ( X ≤ 2 )=P( X ≤ μ+2 σ )=0.9772
Therefore, P(X> 2) = 1 - 0.977 = 0.023 for any given day. Since each day's draws are independent, the
expected time until drawing an X > 2 follows a geometric distribution, with p = 0.023. Letting T be a
random variable denoting the number of days, we have the following:

79 Ace The Data Science Interview Statistics


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

1 1
E [ T ]= = =44 days
p .02272
Solution #6.17
Let the variances for X and Y be denoted by Var (X) and Var (Y).
Then, recalling that the variance of a sum of variables is expressed as follows:
Var ( X+ Y )=Var (X )+Var (Y )+2Cov ( X , Y )
and that-a constant coefficient of a random variable is assessed as follows: Var ( aX )=a2 Var ( X )

We have Var (aX+ bY )=a 2 Var (X )+b 2 Var (Y )+2 abCov (X , Y ) ,which would provide the bounds
on the designated variance; the range will depend on the covariance between X and Y.

Solution #6.18
Let Z = min(X,Y). Then we know the following: P ( Z ≤ z ) =P ( min ( X ,Y ) ≤ z )=1− p( X > z ,Y > z )
For a uniform distribution, the following is true for a value of z between 0 and 1:
P ( X > z )=1−z and P ( Y > z )=1−z
Since X and Y are i.i.d., this yields: P ( Z ≤ z ) =1−P ( X > z ,Y > z ) =1−(1−z)2
Now we have the cumulative distribution function for z. We can get the probability density function
by taking the derivative of the CDF to obtain the following: fz(z )=2(1−z ). Then, solving for the
expected value by taking the integral yields the following:
1 1
E [ Z ] =∫ zfz ( z ) dz=2 ∫ z ( 1−z ) dz =2
0 0
( 12− 13 )= 13
Therefore, the expected value for the minimum of X and Y is 1/3.

Solution #6.19
Say we flip Ibe unfair coin n times. Each flip is a Bernoulli trial with a success probability of p:
x 1 , x 2 ,… x n , x i Ber ( p)
We can construct a confidence interval for p as follows, using the Central Limit Theorem. First, we
decide on our level of confidence, If we select a 95% confidence level, the necessary z-score is z 1.96.
We then construct a 95% confidence interval for p. If it does not include 0.5 as its lower bound, then
we can reject the null hypothesis that the coin is fair.
Since the trials are i.i.d., we can compute the sample mean for p from a large number of trials:
n
1
^p= ∑x
n i=0 i
np np(1− p) p (1− p)
We know the following properties hold: E [ ^p ] = = p and Var= ( ^p )= =
n n2 n

Therefore, our 95% confidence interval is given by the following: p ± z


√ p(1−p)
n
Since the true p = 0.6, plugging that in and setting the lower bound of the interval equal to 0.5
yields:

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solving for n yields 93 flips.


0.6−1.96
√ 0.6 ( 1−0.6 )
n
=0.5

Solution #6.20
Let the following be an indicator random variable: Xi = 1 if i is drawn in n turns
We would then want to find the following:
n

∑ E [ Xi]
i=0

We know that p(Xi = 1) = 1 - p(Xi = 0), so the probability of a number not being drawn (where each
draw is independent) is the following:

( )
n
n−1
p ( X i=0 )=
n

( )
n
n−1
Therefore, we have: p ( X i=1 )=1− and by linearity of expectation, we then have:
n

( ( ))
n n
n−1
∑ E [ X i ]=n E [ X i ]=n 1− n
i=0

Solution #6.21
Say that we have n noodles. At any given step, we will have one of two outcomes: (1) we pick two
ends from the same noodle (which makes a loop), or (2) we pick two ends from different noodles.
Let Xn denote a random variable representing the number of loops with n noodles remaining.
n 1
=
The probability of case (1) happening is: 2 n
n ( )
2 n−1

where the denominator represents the number of ends we can choose from the noodles, and the
numerator represents the number of cases where we choose the same noodle.
1 2 n−2
Therefore, the probability of case (2) happening is: 1− =
2 n−1 2 n−1
Then, taking case (1) and (2), we have the following recursive formulation för the expectation of the
number of loops formed:
1 2 n−2
E [ X n ]= + E [ X n−1 ]
2 n−1 2 n−1
Plugging in E [ X 1 ] = 1 and calculating the first few terms, we can notice the following pattern, for
which we can plug in n = 100 to obtain the answer:
1 1
E [ X 100 ]=1+ + ...+ ≈ 3.3
3 2 ( 100 )−1
Solution #6.22
Since we only have two dice, let the maximum value between the two be m. Let

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

X 1 , X 2 ,Y =max ⁡(X 1 , X 2)
denote the first roll, second roll, and the max of the two. Then we want to find the following:
6
E [ Y ] =∑ i∗P (Y =1)
i=1

We can condition Y= m on three cases: (1) die one is the max roll; (2) die two is the max roll; or (3)
they are both the same.

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

1
∗i−1
For cases (1) and (2) we have: 6
P ( X 1=i , X 2 <i )=P ( X 2=i, X 1 <i ) =
6
"For case (3), where both dice are the maximum:"
1
∗1
6
P ( X 1= X 2=i )=
6

( )
1 1
∗i−1 6 ∗1
putting everything together yields the following: 6 6 161
E [ Y ] =∑ i∗ ∗2+ =
i=1 6 6 36
A simpler way to visualize this is to use a contingency table, such as the one below:
1 2 3 4 5 6
(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)

Then the expectation is simply given by:


1 3 5 7 9 11 161
E [ Y ] =1 × + 2× +3 × +4 × + 5× +6 × = ≈ 4.5
6 6 6 6 6 6 36

Solution #6.23
1
For X – U(a, b), we have the following: fx( x)= b−a
Therefore, we can calculate the mean as:

|
b b 2
x x b a+b
E [ X ] =∫ xf x ( x ) dx=∫ dx= =
a a b−a 2 (b−a) a 2

Similarly, the variance can be as expressed as follows: Var ( X )=E [ X 2 ] −E [ X ]


2

Giving us:

|
b b 2 3 2 2
x x b a +ab +b
E [ X ]=∫ xf x ( x ) dx=∫
2
dx= =
a a b−a 3(b−a) a 3

( )
2
a 2+ ab2 +b2 a+b 2 (b−a)
Therefore: Var ( X )= − =
3 2 12

Solution #6.24
Although one can enumerate all the probabilities, this can get a bit messy from an algebraic
standpoint, so obtaining the following intuitive answer is more preferable. Imagine we have aces A1,
A2, A3, A4. We can then draw a line in between them to represent an arbitrary number (including 0)
of cards between each ace, with a line before the first ace and after the last.

Ace the Science Interview 83


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

|A1|A2|A3|A4|
There are 52 - 4 = 48 non-ace cards in a deck. Each of these cards is equally likely to be in any of the
five lines. Therefore, there should be 48/5 = 9.6 cards drawn prior to the first ace being drawn,
Hence, the expected number of cards drawn until the first ace is seen is 9.6 + 1 = 10.6 cards — we
can't forget to add 1, because we need to include drawing the ace card itself.

Solution #6.25
1
Note that for a uniform distribution, the probability density is for any value on the interval [a,
b−a
b], The likelihood function is therefbre as follows:

( )
n
1
f ( x 1 … , xn| a ,b ¿=
b−a
To obtain the MLE, we maximize this likelihood function, which is clearly maximized if b is the largest
of the samples and a is the smallest of the samples. Therefore, we have the following:

a^ =min ( x 1 … , x n )= b^ max ( x 1 … , x n )

Solution #6.26
Assume that we have an indicator random variable: X i = 1 if the sequence is increasing up to ith
element, and otherwise Xi = 0.
Then, we calculate the expectation: E[X 1 + X2 + ...]. Consider some arbitrary i. In order to draw up to
element i, the entire sequence up to i must be monotonically increasing, which means that the
following is true: X1 < X2 < ...< Xi. Given that there are n possible sequences of the elements, there is a
1
i!
1
chance of Xi being 1. Since each X is i.i.d., we then have: E [ X 1+ X 2+... ] =1+ +...=e−1
2!
Solution #6.27
One method of solving this problem is brute force method, which consists of computing the
expected values by listing all of the outcomes and associated probabilities and payoffs. However,
there exists an easier way of solving the problem.
Assume that the outcome of the roll of a die is given by a random variable X (meaning that it takes
on the values 1...6 with equal probability). Then, the question is equivalent to asking, "What is E[X] *
E[X] = E[X]2 (i.e., the expected value of the product of two separate rolls), versus E[X 2] (the expected
value of the square of a single roll)?"
Recall that the variance of a given random variable X is as follows:

Var ( X )=E [ ( X −E|X|) ] =E [ X ]−2 E [ X ] E [ X ] + E [ X ] =E [ X ] −E ¿


2 2 2 2

Typically, this variance term is exactly the difference between the two sets of die rolls — the two
“games" (the payoff of the second game minus the payoff of the first game). Since the left-hand side
is positive, as expected for the value of a squared number, then the right-hand side is also positive,
Therefore, it must be the case that the second game has a higher expected value than the first.

Ace the Data Science Interview 84


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #6.28
In both cases, we are dealing with an estimator of the true parameter value. An estimator is
unbiased if the expectation of the estimator is the true underlying parameter value. An estimator is
consistent if, as the sample size increases, the estimator's sampling distribution converges towards
the true parameter value.
Consider the following random variable X, which is normally; distributed, and n i.i.d. samples used to
calculate a sample mean:
x 1+ x 2 …+ x n
X N ( μ , σ 2 ) and x=
n
The first sample is an example of an unbiased but not consistent estimator. It is unbiased since
E [ x 1 ]=μ. However, it is not consistent since, as the sample size increascs, the sampling distribution
of the first sample does not become more concentrated with respect to the true mean.
n
1
An example of a biased but consistent estimator is the sample variance: Sn=
2

n i=1
( x i−x )
2

n−1 2
It can be shown that E [ Sn ]=
2
σ
n
The formal proof of the above is called Bessel's correction, but there is an intuitive way to grasp the
presence of the term preceding the variance. If we uniformly sample two numbers randomly from
the series of numbers 1 to n, we have an n/n2 = 1/n chance that the two equal the same number,
meaning the sampled squared difference of the numbers will be zero. The sample variance will
therefore slightly underestimate the true variance. However, this bias goes to 0 as n approaches
infinity, since the term in front of the variance, (n-1/n), approaches 1. Therefore, the estimator is
consistent.

Solution #6.29
MLE stands for maximum likelihood estimation, and MAP for maximum a posteriori. Both are ways
of estimating variables in a probability distribution by producing a single estimate of that variable.
Assume that we have a likelihood function P(X|). Given n i.i.d. samples, the MLE is as follows:
n
MLE ( θ )=max P(X ∨θ)=max ∏ P ( x i∨θ)
0 0 i
Since the product of multiple numbers all valued between 0 and 1 might be very small, maximizing
the log function of the product above is more convenient. This is an equivalent problem, since the
log function is monotonically increasing. Since the log of a product is equivalent to the sum of logs,
the MLE becomes the following:
n
MLE log ( θ ) =max ∑ log P(x i∨θ)
0 i=1
Relying on Bayes rule, MAP uses the posterior P(|X) being proportional to the likelihood multiplied
by a prior P(), i.e., P(|X) P(), The MAP for  is thus the following:
n
MAP ( θ )=max P(X ∨θ)=max ∏ P( xi ∨θ)P(θ)
0 0 i
Employing the same math as used in calculating the MLE, the MAP becomes:

Ace the Data Science Interview 85


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

n
MAP log ( θ )=max ∑ log P(x i∨θ)+log P(θ)
0 i=1
Therefore, the only difference between the MLE and MAP is the inclusion of the prior in MAP;
otherwise, the two are identical. Moreover, MLE can be seen as a special case of the MAP with a
uniform prior.

Solution #6.30
Assume we have Bernoulli trials, each With a p probability success. Altogether, they form a binomial
distribution: x1,x2,…,xn, X  B(n,p) where xi = 1 means success and xi = 0 means failure. Assuming
i.i.d, trials, we can compute the sample proportion for ^p as follows:
n
1
^p= ∑x
n i=1 i

We know that if n is large enough, then the binomial distribution approximates the following normal
distribution:

^p=N p ,
( p(1− p)
n )
where n must be np ≥ 10 , n(1− p)≥10
Therefore, the value ^p can be used as simulation for a normal distribution. The sample size n must
only be large enough to satisfy the conditions above (at least n = 20 for p = .5), but it is
recommended to use a significantly larger n to get the better normal approximation.

^p − p
^p : ^p 0=
Finally, to simulate the standard normal distribution, we normalize
√ p(1−p)
n
1
n

∑ x −p
n i=1 i
At this point, we can derive the final formula for our normal random generator: x=

n
√ p(1−p)
n

The previous expression can be simplified to the following:


∑ x i−np
i=1
x=
√ np(1− p)
where x1,…,xn is the Bernoulli series we get from the given random generator, with probability of
success p.

Solution #6.31
We are seeking the expected value of geometric random variable X as follows:

E [ X ] =∑ k f x ( k )
k=1

The expression above contains a summation instead of an integral since k is a discrete rather than
continuous random variable, and we know the probability mass function of the geometric
probability distribution is given by the following: fx(k )=(1− p)k−1 p

Ace the Data Science Interview 86


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Therefore, we obtain the expected value of X as follows:



E [ X ] =∑ k ( 1− p )
k−1
p
k=1

Since p is constant with respect to k, we separate out p as follows:



E [ X ] = p ∑ k ( 1−p )
k−1

k=1
Note that the term inside the summation is really the following:
∞ ∞ ∞

∑ k (1− p ) k−1
=∑ k (1−p )
k−1
+∑ k ( 1− p )
k−1
+.. .
k =1 k=1 k=2

This simplifies to the following:

∑ k (1− p )k−1=( 1p )+ ( 1− p ) / p+( 1− p )2 / p+…= 1p (1+( 1− p )+ (1− p )2 +...)


k =1
Plugging this back into the equation for the expected value of X yields the following:

Ace the Data Science Interview 87


ACE THE DATA SCIENCE INTERVIEW I HUO SINGH

p∗1 1
E [ X ]= = Solution #6.32
p
2
p
We can define a new variable Y= F(X), and, hence, we want to find the CDF of y (where y is between
0 and 1 by definition of a CDF): F y ( y ) =P(Y ≤ y )
Substituting in for Y yields the following: F y ( y ) =P(F (X )≤ y )
Applying the inverse CDF on both sides yields the following:

F y ( y ) =P ( F−1 ( F ( X ) ) ≤ F−1 ( y )) =P ( X ≤ F−1 ( y ) )


Note that the last expression is simply the CDF for: P ( X ≤ F −1 ( y ) ) =F ( F−1 ( y )) = y
Therefore, we have: F y ( y ) = y
Since y falls between 0 and 1, Y’s distribution is simply a uniform one from 0 to 1, i.e., U(0, 1).

Solution #6.33
A moment generating function is the following function for a given random variable:
sX
M x ( s )=E [e ]
If X is continuous (as in the case of normal distributions), then the function becomes the following:

M x ( s )= ∫ e f x ( x ) dx
sX

−∞

Hence, the moment generating function is a function for a given value of s. It is useful for calculating
moments, since taking derivatives of the moment generating function and evaluating at s = 0 yields
the desired moment.

e2( π )
2
−1 x−μ
1
For a normal distribution, recall that: f x ( x ) =
σ √2 π
First, taking the special case of the standard normal random variable, we have the following:
−1 2
1 x
f x ( x) = e 2
√2 π
Plugging this into the above MGF yields:
∞ −1 2 ∞ −1 2
1 x 1 x + sx
M x ( s )= ∫ e ∫e
sx
e 2
dx= 2
dx
−∞ √2 π √ 2 π −∞
2 2 2
s ∞ −1 2 s2 s ∞ −(x−s)2 s
1 x +sx − 1
Completing the square yields: M x ( s )=e ∫e
√2 π −∞
2 2 2
dx=e 2
∫e
√ 2 π −∞
2
dx=e 2

Note that the last step uses the fact that the expression within the integral is a PDF for a normally
distributed random variable with mean s and variance 1, and hence the integral evaluates to 1. 
To solve for a general random variable, you can plug in X = Y + µ, where Y is standard normal
2 2

variable, to yield: M x ( s )=esμ , M y ( sσ )=e(s σ / 2)+ sμ

Solution #6.34
Denote the n i.i.d, draws as: x1,x2,…,xn where, for any individual draw, we have the pdf:
− xi
f x ( x i) =e

Ace the Data Science 88


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Therefore the likelihood of the data is given by the following:

( )
n n
L (❑i x 1 … x n) =∏ f x ( x i )=❑ exp −∑ x i
n

i=1 i=1

Taking the log of the equation above to obtain the log-likelihood results in the following:
n
log L (❑i x 1 … x n )=n log ()−¿ ∑ xi ¿
i=1
n
n
Taking the derivative with respect to and setting the results to 0 yields: −∑ x i=0
❑ i=1

^ n
❑= n
Therefore, the best estimate of is given by:
∑ xi
i=1

Solution #6.35
Define Y = log X. We then want to solve for: E[eY] = E[X]
sY
Recall that a moment generating function has the following form: M y ( s )=E [e ]
Therefore, we want the moment generating function for Y  N(0, 1), which was derived in problem
2

33 and has the form: M y ( s )=e s /2


1/ 2
Therefore, evaluating at s = 1 (since we want the mean) gives: M y (1 )=e which is the desired
answer.

Solution #6.36
Say that the two have two distinct group sizes: n1 size of group 1, and n2 = size of group 2.
Given the means of two groups, µ1 and µ2, the blended mean can be found simply by taking a
weighted average:
n 1 μ1 +n2 μ 2
μ=
n1 +n2
We know that the blended standard deviation for the total data set has the form:


n1+n2

∑ (z 1−μ)2
i=1
s=
n1 + n2
where z, is the union of the points from both groups.
However, since we are not given the initial data points from the two groups, we have to rearrange
2 2
this formula by using instead the given variations of these groups, s1 and s 2, as follows:

s=

n1 s 21+ n2 s 22+ n1 ( y 1− y )2+ n2 ( y 2− y )2
n1 +n2
Applying the Bessel correction, the blended standard deviation for the two groups is as follows:


s= (n¿¿ 1−1)s +
(n¿¿ 2−1)s 22 +n1 (μ1−μ)2 +n2 (μ2 −μ)2
2
1
n1+ n2−1
To extend the definition above to subsets, the mean is as follows:
¿¿

Ace the Data Science Interview 89


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

∑ n i μi
μ= i=1K
∑ ni
i=1

And the standard deviation is:


K

∑ (n¿¿ i−1)s2i +ni (μ i−μ)2


i=1
s K= K
¿
∑ ni−1
i=1
where ni are the sizes of initial groups, µi and si are their respective means and standard deviations.

Solution #6.37
Independence is defined as follows: P( X=x ,Y = y)=P ( X=x ) P (Y = y ) for all x, y. Equivalently,
we can use the following definitions: P( X=x∨Y = y)=P ( X=x ) , P ¿
When two random variables X and Y are uncorrelated, their covariance, which is calculated as
follows, is 0: Cov ( X , Y ) =E[ XY ]−E [ X ] E[Y ]
For an example of uncorrelated but not independent variables, let X take on values -1, 0, or 1 with
equal probability, and let Y = 1 if X = 0 and Y = 0 otherwise. Then we can verify that X and Y are
uncorrelated:
1 1 1
E ( XY ) = (−1 )( 0 )+ ( 0 ) ( 1 )+ ( 1 )( 0 )=0
3 3 3
And E[X] = 0, so the covariance between the two random variables is zero. However, it is clear that
the two are not independent, since we defined Y in such a way that it obviously depends on X.
P¿
For example, P(Y = 1 | X = 0) = 1

Solution #6.38
[
By definition of the covariance, we have: Cov ( X , Y ) =Cov ( X , X 2 ) =E ( X −E [ X ]) ( X 2−E [ X 2 ] ) ]
Expanding terms of the equation above yields:
Cov ( X , Y ) =E ¿
Using linearity of expectation, we obtain
Cov ( X , Y ) =E [ X 3 ] −E [ X ] E [ X 2 ]−E [ X 2 ] E [ X ] + E [ X ] E [ X 2 ]
Since the second and last terms cancel one another, we end up with the following:

Cov ( X , Y ) =E [ X 3 ] −E [ X 2 ] E [ X ]
Here, we conclude that E[X] = 0 (based on the definition of X) and that E [ X 3 ] = 0 by evaluating the
probability density function of X as follows:
1 1 1
f x ( x) = = =
b−a 1−(1) 2
1 1
1 3
Since we are evaluating X from -1 to 1, we then have: E [ X ] =∫ x f ( x ) dx=∫
3 3
x dx=0
−1 −1 2

Ace the Data Science Interview 90


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Thus, the covariance between X and Y is 0.

Solution #6.39
This can be proved using the inverse-transform method, whereby we sample from a uniform
distribution and then simulate the points on the circle employing the inverse cumulative distribution
functions (i.e., inverse CDFs).
We can define a random point within the circle using a given radius value and an angle (and obtain
the corresponding x, y values from polar coordinates). To sample a random radius, consider the
following. If we sample points from a radius r, we know that there are 2r points to consider (i.e., the
circumference of the circle). Likewise, if we sample a radius 2r, there are 4r points to consider.
Therefore, we have the following probability density function given by the following:
2r
f R (r )= 2
R
2
r
This follows from the CDF, which is given by the ratio of the areas of the two circles: f R ( r ) = 2
R

Ace the Data Science Interview 91


CHAPTER STATISTICS

2
r
Therefore, for the inverse sampling, we want the following: y ¿ 2
R
This simplifies to the following: √ R2 y=r
Therefore, we can sample Y  U(0, 1) and the corresponding radius will be the following:
r =R √ y
For the corresponding angles, we can sample theta uniformly from the range 0 to 2:  ϵ [0, 2] and
then set the following: x = r cos(), y = r sin()

Solution #6.40
n
Let us define: Nt = smallest n such that: ∑ U i> t for any value t between 0 and 1. Then we want to
i=1
find: m(t) = E[Nt]
Consider the first draw. Assuming that result is some value x, we then have two cases as follows. The
first is that x > t, in which Nt = 1
The second is that x < t, necessitating that we sample again, yielding: N t =1+ N t −x
1
Putting these two together, we have: m ( t ) =1+∫ m( t−x) dx
0

Employing the following change of variables: u=t−x , du=−dx


1
We then substitute and simplify to obtain: m ( t ) =1+∫ m ( u ) du
0

Differentiating both sides, we then obtain: m'(t) = m(t)


Since m(0) = 1, we then have: m(t) = e t
Since we actually need to find m(N t =1), we can plug in t = 1 into the equation, which yields the
desired result m(1) = e.

92 Ace ihe Data Science Interview Statistics


Machine Learning
CHAPTER 7

How much machine learning do you actually need to know to land a top job in Silicon Valley
or Wall Street? Probably less than you think! From coaching hundreds of data folks on the
job hunt, one of the most common misconceptions we saw was candidates thinking their
lack of deep learning expertise would tank their performance in data science interviews.
However, The truth is that most data scientists are hired to solve business problems — not
blindly throw complicated neural networks on top of dirty data. As such, a data scientist with
strong business intuition can create more business value by applying linear regression in an
Excel sheet than a script kiddie whose knowledge doesn’t extend beyond the Keras API.
Son unless you're interviewing for ML Engineering or research scientist roles, a solid
understanding off the classical machine learning techniques covered in this chapter is all you
need to ace the data science interview. However, if you are aiming for ML-heavy roles that
do require advanced knowledge, this chapter will still be handy! Throughout this chapter, we
frequently call attention to which topics and types of questions show up in tougher ML
interviews. Plus, the 35 questions at the end of the chapter — especially the hard ones —
will challenge even the most seasoned ML practitioner.

Ace the Data Science Interview 93


What to Expect for ML Interview Questions
When machine learning is brought up in an interview context, the problems fall into three major
buckets:

 Conceptual questions: Do you have a strong theoretical ML background?


 Resume-driven questions: Have you actively applied ML before'?
 End-to-end 'modeling questions: Can you apply ML to a hypothetical business problem?
Conceptual Questions
Conceptual questions usually center around what different machine leaming terms mean and how
popular machine learning techniques operate. For example, two frequently asked questions are
"What is the bias-variance trade-off?" and "How does PCA work?" To test your ability to
communicate with nontechnical stakeholders, a common twist on these conceptual questions is to
ask you to explain the answer as if they (the interviewer) were five years old (similar to Reddit's
popular r/EL15 subreddit).
Because many data science roles don't require hardcore machine learning knowledge, easier,
straightforward questions such as these represent the vast majority of questions you'd expect during
a typical interview. Being asked easier ML questions is especially the case when interviewing for a
data science role that's more product and business analytics oriented, as having to build models just
simply isn't part of the day-to-day work.
For ML-intensive positions like ML Engineer or Research Scientist, interviews also start with high-
level easier conceptual questions but then push you to dive deeper into the details via follow-up
questions. Companies do this to make sure you aren't a walking, talking ML buzzword generator. For
example, as a follow-up to defining the bias-variance trade-off, you might be asked to whiteboard
the math behind the concept. Instead of simply asking you how PCA (principal components analysis)
works, you might also be asked about the most common pitfalls of using PCA.
Since ML interviews are so expansive in scope, if asked about a particular technique you may not be
overly familiar with, it's perfectly okay to say, "I've read about it in the past. I don't have any
experience with these types of techniques, but I am interested to learn more about them!"
This signals honesty and an eagerness to learn (and don't be ashamed to admit not knowing
something — nobody knows all the techniques in detail! Trust us, it's better than pretending you
know the techniques and then falling apart when questions are asked).
If nothing on your resume seems interesting to an interviewer, but they still want to go deep into
one ML topic, they have you pick the topic. They do this by either asking "What's your favorite ML
algorithm?" or "What's a model you use often and why?" Consequently, it pays to have a deep
understanding of at least a single technique — something you've actually used before and that is
listed on your resume.
Word of caution: don't choose something about a slate-of-the-art transformer model to discuss as
your favorite technique. Your details on it may be hazy, and your interviewer might not know enough
about it to carry on a good conversation. You are better off picking something fundamental yet
interesting (to you) so that you and your interviewer can have a meaningful discussion. For example,
our answer happens to be that we both like random forests because they can handle classification or
regression tasks with all kinds of input features with minimal preprocessing needed. Additionally, we
both have projects on our resume to back up our interest in random forests.

Ace the Data Science Interview 94


Resume-Driven Questions
The next most common type of interview question for ML interviews is the resume-driven question.
Resume-driven questions are often about showcasing that you have practical experience (as
opposed

Ace the Data Science Interview 95


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

to conceptual knowledge). As such, if you have job experience that is directly relevant, interviewers
will often ask about that. If not, they'll often fall back to asking about your projects.
While anything listed on your resume is fair game to be picked apart, this is especially true for more
ML-heavy roles. Because the field is so vast and continually evolving, an interviewer isn't able to
assess your fit for the job by asking about some niche topic unrelated to the position at hand. For
example, say you are going for a general data science role it's not fair to ask a candidate about CNNs
and their use in computer vision if they have no experience with this topic and it's not relevant to
the job. But, suppose you hacked together a self-driving toy car last summer, and listed it on your
resume. In that case — even though the role at hand may not require computer vision — it's totally
fair game to be asked more about the neural network architecture you used, model training issues
you faced, and trade-offs you made versus other techniques. Plus, in an effort to see if you know the
details not just of your project, but of the greater landscape, you'd also be expected to answer
questions tangentially related to the project.

End-to-End Modeling Questions


Finally, the last type of ML-related problems can expect during interviews are end-to-end modeling
questions. Interviewers are testing your ability to go beyond the ML theory covered in books like An
Introduction to Statistical Learning and actually apply what you learned to solve real-world
problems. Examples of questions include ' 'How would you match Uber drivers to riders?" and "How
would you build a search autocomplete feature for Pinterest?" While these open-ended problems
are an interview staple for any machine-leaming-heavy role, they do also pop up during generalist
data science interviews.
At the end of this chapter, we cover the end-to-end machine learning workflow, which can serve as a
framework for answering these broad ML questions. We cover steps like problem definition, feature
engineering, and performance metric selection — things you'd do before jumping into the various
ML techniques we soon cover. To better solve these ML case study problems, we also recommend
reading Chapter 11: Case Study to understand the non-ML-specific advice we offer for tackling open-
ended problems.

The Math Behind Machine Learning


While the probability and statistics concepts upon which machine learning's foundation is built are
fair game for interviews, you're less likely to be asked about the linear algebra and multivariable
calculus concepts that underlie machine learning. There are, however, two notable exceptions: if
you’re interviewing for a research scientist position or for quant finance. In these cases, you may be
expected to whiteboard proofs and derive formulas. For example, you could be asked to derive the
least squares estimator in linear regression or explain how to calculate the principal components in
PCA. Sometimes, to see how strong your first principles are, you'll be given a math problem more
indirectly. For instance, you could be asked to analyze the statistical factors driving portfolio returns
(which essentially boils down to explaining the math behind PCA). Regardless of the role and
company, we still recommend you review the basics, since understanding them will help you grok
the theoretical underpinnings of the techniques covered later in this chapter.

Linear Algebra
The main linear algebra subtopic worth touching on for interviews is eigenvalues and eigenvectors.
Mechanically, for some n × n matrix A, x is an eigenvector of A if: Ax=x, where  is a scalar. A

Ace the Data Science Interview 96


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

matrix can represent a linear transformation and, when applied to a vector x, results in another
vector called an eigenvector, which has the same direction as x and is in fact x multiplied by a scaling
factor  known as an eigenvalue.
The decomposition of a square matrix into its eigenvectors is called an eigendecomposition.
However, not all matrices are square. Non-square matrices are decomposed using a method called
singular value decomposition (SVD). A matrix to which SVD is applied has a decomposition of the
form: A=U ∑ V T , where U is an m ×m matrix,  is an m ×nmatrix, and V is an n × nmatrix.
There are many applications of linear algebra in ML, ranging from the matrix multiplications during
backpropagation in neural networks, to using eigendecomposition of a covariance matrix in PCA. As
such, during technical interviews for ML engineering and quantitative finance roles-you should be
able to whiteboard any follow-up questions on the linear algebra concepts underlying techniques
like PCA and linear regression. Other linear algebra topics you're expected to know are core building
blocks like vector spaces, projections, inverses, matrix transformations, determinants,
orthonormality, and diagonalization.

Gradient Descent
Machine learning is concerned with minimizing some particular objective function (most commonly
known as a loss or cost function). A loss function measures how well a particular model fits a given
dataset, and the lower the cost, the more desirable. Techniques to optimize the loss function are
known as optimization methods.
One popular optimization method is gradient descent, which takes small steps in the direction of
steepest descent for a particular objective function. It's akin to racing down a hill. To win, you always
take a “next step” in the steepest direction downhill.

Cost
Gradient

Minimum Cost

Weight
For convex functions, the gradient descent algorithm eventually finds the optimal point by updating
the below equation until the value at the next iteration is very close to the current iteration
(convergence):
x i+1=x i−❑i ∇ f ( x i )
that is, it calculates the negative of the gradient of the cost function and scales that by some
constant
i, which is known as the learning rate, and then moves in that direction at each iteration of the
algorithm
Since many cost functions in machine learning can be broken down into the sum of individual
functions, the gradient step can be broken down into adding separate gradients. However, this
process can be computationally expensive, and the algorithm may get stuck at a local minimum or

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

saddle point. Therefore, we can use a version of gradient descent called stochastic gradient
descent (SGD),

which adds an element of randomness so that the gradient does not get stuck. SGD uses one data
point at a time for a single step and uses a much smaller subset of data points at any given step, but
is nonetheless able to obtain an unbiased estimate of the true gradient. Alternatively, we can use
batch gradient descent (BGD), which uses a fixed, small number (a mini-batch) of data points per
step.

Gradient descent and SGD are popular topics for ML interviews since they are used to optimize the
training of almost all machine learning methods. Besides the usual questions on the high-level
concepts and mathematical details, you may be asked when you would want to use one or the other.
You might even be asked to implement a basic version of SGD in a coding interview (which we cover
in Chapter 9, problem #30).

Model Evaluation and Selection


With the math underlying machine learning techniques out of the way, how do we actually choose
the best model for our problem, or compare two models against each other? Model evaluation is the
process of evaluating how well a model performs on the test set after it's been trained on the train
set. Separating out your training data — usually 80% for the train set — from the 20% of the test set
is critical because the usefulness of a model boils down to how good predictions are on data that has
not been seen before.
Model selection, as the name implies, is the process of selecting which model to implement after
each model has been evaluated. Both steps (evaluation and selection) are critical to get right,
because even tiny changes in model performance can lead to massive gains at big tech companies.
For example, at Facebook, a model that can cause even a 0.1% lift in ad click-through rates can lead
to $10+ million in extra revenue.
That's why in interviews, especially during case-study questions where you solve an open-ended
problem, discussions often head toward comparing and contrasting models, and selecting the most
suitable one after factoring in business and product constraints. Thus, internalizing the concepts
covered in this section is key to succeeding in ML interviews.

Bias-Variance Trade-off
The bias-variance trade-off is an interview classic, and is a key framework for understanding different
kinds of models. With any model, we are usually tying to estimate a function f(x), which predicts our
target variable y based on our input x. This relationship can be described as follows:
y=f ( x )+ w
where w is noise, not captured by f(x), and is assumed to be distributed as a zero-mean Gaussian
random variable for certain regression problems. To assess how well the model fits, we can
decompose the error of y into the following:

1. Bias: how close the model's predicted values come to the true underlying f(x) values, with
smaller being better
2. Variance: the extent to which model prediction error changes based on training inputs, with
smaller being better

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

3. Irreducible error: variation due to inherently noisy observation processes

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

The trade-off between bias and variance provides a lens through which you can analyze different
models. Say we want to predict housing prices given a large set of potential predictors (square
footage of a house, the number of bathrooms, and so on). A model with high bias but low variance,
such as linear regression, is easy to implement but may oversimplify the situation at hand. This high
bias but low variance situation would mean that predicted house prices are frequently off from the
market value, but the variance in these predicted prices is low. On the flip side, a model with low
bias and high variance, such as neural networks, would lead to predicted house prices closer to
market value, but with predictions varying wildly based on the input features,

Low Variance High Variance

Low Bias

High Bias

While the bias-variance trade-off equation occasionally shows up in data science interviews, more
frequently, you'll be asked to reason about the bias-variance trade-off given a specific situation. For
example, presented with a model that has high variance, you could mention how you'd source
additional data to fix the issue. Posed with a situation where the model has high bias, you could
discuss how increasing the complexity of the model could help. By understanding the business and
product requirements, you'll know how to make the bias-variance trade-off for the interview
problem posed.

Model Complexity and Overfitting


"All models are wrong, but some are useful" is a well-known adage, coined by statistician George
Box. Ultimately, our goal is to discover a model that can generalize to learn some relationship within
datasets. Occam's razor, applied 10 machine learning, suggests that simpler models are generally
more useful and correct than more complicated models. That's because simpler, more parsimonious
models tend to generalize better.
Said another way, simpler, smaller models are less likely to overfit (fit too closely to the training
data). Overfit models tend not to generalize well out of sample. That's because during overfitting,
the models pick up too much noise or random fluctuations using the training data, which hinders
performance on data the model has never seen before.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

High Variance High Bias Low Bias, Low Variance

Overfitting Underfitting Good Balance

Underfitting refers to the opposite case — the scenario where the model is not learning enough of
the true relationship underlying the data. Because overfitting is SQ common in real-world machine
learning, interviewers commonly ask you how you can detect it, and what you can do to avoid it,
which brings us to our next topic: regularization.

Regularization
Regularization aims to reduce the complexity of models. In relation to the bias-variance trade-off,
regularization aims to decrease complexity in a way that significantly reduces variance while only
slightly increasing bias. The most widely used forms of regularization are L1 and L2. Both methods
add a simple penalty term to the objective function. The penalty helps shrink coefficients of features:
which reduces overfitting. This is why, not surprisingly, they are also known as shrinkage methods.
Specifically, L1, also known as lasso, uses the absolute value of a coefficient to the objective function
as a penalty. On the other hand, L2, also known as ridge, uses the squared magnitude of a coefficient
to the objective function, The L1 and L2 penalties can also be linearly combined, resulting in the
popular form of regularization called elastic net, Since having models overfit is a prevalent problem
in machine learning, it's important to understand when to use each type of regularization. For
example, L1 serves as a feature selection method, since many coefficients shrink to 0 (are zeroed
out), and hence, are removed from the model. L2 is less likely to shrink any coefficients to 0.
Therefore, L1 regularization leads to sparser models, and is thus considered a more strict shrinkage
operation.

Interpretability 8 Explainability
In Kaggle competitions and classwork you might be expected to maximize a model performance
metric like accuracy. However, in the real world, rather than just maximizing a particular metric, you
might also be responsible for explaining how your model came up with that output. For example, if
your model predicts that someone shouldn't get a loan, doesn't that person deserve to know why?
More broadly, interpretable models can help you identify biases in the model, which leads to more
ethical Al. Plus, in some like healthcare, there can be deep auditing on decisions, and explainable
models can help you stay compliant. However, there's usually a trade-off between performance and
model interpretability. Often, using a more complex model might increase performance, but make
results harder to interpret
Various models have their own way of interpreting feature importance, For example, linear models
have weights which can be visualized and analyzed to interpret the decision making. Similarly,
random forests have feature importance readily available to identify what the model is using and
learning. There are also some general frameworks that can help with more "black-box" models, One
is SHAP (SHapIey Additive exPlanation), which uses "Shapley" values to denote, the average marginal
contribution of a feature over all possible combinations of inputs. Another technique is LIME (Local

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Interpretable Model-agnostic Explanations), which uses sparse linear models built around various
predictions to understand how any model performs in that local vicinity.
While it's rare to be asked about the details of SHAP and LIME during interviews, having a basic
understanding of why model interpretability matters, and bringing up this consideration in more
open-ended problems is key.
Model Training
We've covered frameworks to evaluate models, and selected the best-performing ones, but how do
we actually train the model in the first place? If you don't master the art of model training (aka
teaching machines to learn), even the best machine learning techniques will fail. Recall the basics:
we first train models on a training dataset and then test the models on a testing dataset. Normally,
80% of the data will go towards training data, and 20% serves as the test set. But as we soon cover,
there's much more to model training than the 80/20 train vs. test split.
Cross-Validation
Cross-validation assesses the performance of an algorithm in several subsamples of training data. It
consists of running the algorithm on subsamples of the training data, such as the original data
without some of the original observations, and evaluating model performance on the portion of the
data that was excluded from the subsample. This process is repeated many times for the different
subsamples, and the results are combined at the end.
Cross-validation helps you avoid training and testing on the same subsets of data points, which
would lead to overfitting. As mentioned earlier, in cases where there isn't enough data or getting
more data is costly, cross-validation enables you to have more faith in the quality and consistency of
a model 's test performance, Because of this, questions about how cross-validation works and when
to use it are routinely asked in data science interviews.
One popular way to do cross-validation is called k-fold cross-validation. The process is as follows:
1. Randomly shuffle data into equally-sized blocks (folds).
2. For each fold k, train the model on all the data except for fold i, and evaluate the validation error
using block i.
3. Average the k validation errors from step 2 to get an estimate of the true error.
Dataset

Estimation 1 Test 1 Train Train Train Train


Estimation 2 Train Test 2 Train Train Train
Estimation 3 Train Train Test 3 Train Train
Estimation 4 Train Train Train Test 4 Train
Estimation 5 Train Train Train Train Test 5
Example of 5-FoId Cross Validation

Another form of cross-validation you're expected to know for the interview is leave-one-out cross-
validation. LOOCV is a special case of k-fold cross-validation where k is equal to the size of the
dataset (n). That is, it is where the model is testing on every single data point during the cross-
validation.
m the case of larger datasets, cross-validation can become computationally expensive, because
every fold is used for evaluation. In this case, it can be better to use train validation split, where you

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

split the data into three parts: a training set, a dedicated validation set (also known as a "dev" set),
and a test set. The validation set usually ranges from 10%-20% of the entire dataset.
An interview question that comes up from time to time is how to apply cross-validation for
timeseries data. Standard k-fold CV can't be applied, since the time-series data is not randomly
distributed but instead is already in chronological order. Therefore, you should not be using data "in
the future" for predicting data "from the past." Instead, you should use historical data up until a
given point in time, and vary that point in time from the beginning till the end.

Bootstrapping and Bagging


The process of bootstrapping is simply drawing observations from a large data sample repeatedly
(sampling with replacement) and estimating some quantity of a population by averaging estimates
from multiple smaller samples. Besides being useful in cases where the dataset is small,
bootstrapping is also useful for helping deal with class imbalance: for the classes that are rare, we
can generate new samples via bootstrapping.
Another common application of bootstrapping is in ensemble learning: the process of averaging
estimates from many smaller models into a main model. Each individual model is produced using a
particular sample from the process. This process of bootstrap aggregation is also known as bagging.
Later in this chapter, we'll show concrete examples of how random forests utilize bootstrapping and
bagging.
Ensemble methods like random forests, AdaBoost and XGBoost are industry favorites, and as such,
interviewers tend to ask questions about bootstrapping and ensemble learning. For example, one of
the most common interview questions is: "What is the difference between XGBoost and a random
forest?"
Hyperparameter Tuning
Hyperparameters are important because they impact a model's training time, compute resources
needed (and hence cost), and, ultimately, performance. One popular method for tuning
hyperparameters is grid search, which involves forming a grid that is the Cartesian product of those
parameters and then sequentially frying all such combinations and seeing which yields the best
results. While comprehensive, this method can take a long time to run since the cost increases
exponentially with the number of hyperparameters. Another popular hyperparameter tuning
method is random search, where we define a distribution for each parameter and randomly sample
from the joint distribution over all parameters. This solves the problem of exploring an exponentially
increasing search space, but is not necessarily guaranteed to achieve an optimal result.
While not generally asked about in data science interviews for research scientist or learning
engineering roles, hyperparameter tuning techniques such as the methods mentioned earlier, along
with Bayesian hyperparameter optimization, might be brought up. This discussion mostly happens in
the context of neural networks, random forests, or XGBoost. For interviews, you should be able to list
a couple of the hyperparameters for your favorite modeling technique, along with what impacts they
have on generalization.

Training Times and Learning Curves


Training time is another factor to consider when it to model selection, especially for exceedingly
large datasets. As we explain later in the coding chapter, it's possible to use big-O notation to Clarify
the theoretical bounds on training time for each algorithm. These training time estimates are based
on the number of data points and the dimensionality of the data.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

For real-life training ML models, you should also factor in training time considerations and resource
constraints during model selection. While you can always train more complex models that might
achieve marginally higher model performance metrics, the trade-off versus increased resource usage
and training time might make such a decision suboptimal.
Learning curves are plots of model learning performance over time. The y-axis is some metric of
learning (for example, classification accuracy), and the x-axis is experience (time)

1.0- — Train
— validation
0.9-
0.8-

0.7-

0.6-

0.5-

0.4-

0 100 200 300 400 500


Number of Iterations
A popular data science interview question involving leaming curves is "How would you identify if
your model was overfitting?" By analyzing the learning curves, you should be able to spot whether
the model is underfitting or overfitting. For example, above, you can see that as the number of
iterations is increasing, the training error is getting better. However, the validation error is not
improving — in fact, it is increasing at the end — a clear sign that the model is overfitting and
training can be stopped. Additionally, leaming curves should help you discover whether a dataset is
representative or not. If the data was not representative, the plot would show a large gap between
the training curve and validation curve, which doesn't get smaller even as training time increases.

Linear Regression
Linear regression i} a form of supervised learning, where a model is trained on labeled input data.
Linear regression is one of the most popular methods employed in machine learning and has many
real-life applications due to its quick runtime and interpretability. That's why there's the joke about
regression to regression: where you try to solve a problem with more advanced methods but end up
falling back to tried and true linear regression.
As such, linear regression questions are asked in all types of data science and machine learning
interviews. Essentially, interviewers are trying to make sure your knowledge goes beyond importing
linear regression from scikit-learn and then blindly calling linear regression.fit(X,Y). That's why deep
knowledge of linear regression — understanding its assumptions, addressing edge cases that come
up in real-life scenarios, and knowing the different evaluation metrics — will set you apart from
other candidates.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

In linear regression, the goal is to estimate a function f(x), such that each feature has a linear
relationship to the target variable y, or:
y= X β
where X is a matrix of predictor variables and p is a vector of parameters that determines the weight
of each variable in predicting the target variable. So, how do you compare the performance of two
linear regression models?

Evaluating Linear Regression


Evaluation of a regression model is built on the concept of a residual: the distance between what the
model predicted versus the actual value. Linear regression estimates  by minimizing the residual
sum of-squares (RSS), which is given by the following:
T
RSS ( β )=( y− Xβ ) ( y− Xβ)
Two other sum of squares concepts to know besides the RSS are the total sum of squares (TSS) and
explained sum of squares (ESS). The total sum of squares is the combined variation in the data (ESS +
RSS). The explained sum of squares is the difference between TSS and RSS. R 2, a popular metric for
2 RSS
assessing goodness-of-fit, is given by R =1− . It ranges between zero and one, and represents
TSS
the propoition of variability in the data explained by the model. Other prominent error metrics to
measure the goodness-of-fit of linear regression are MSE (mean squared error) and MAE (mean
absolute error). MSE measures the variance of the residuals, whereas MAE measures the average of
the residuals; hence, MSE penalizes larger errors more than MAE, making it mole sensitive to
outliers.

Actual vs Predicted Values from the Dummy Dataset

2 6

Ace the Date Science Intaview 105


ACE THE DATA SCIENCE INTERVIEW & SINGH

A common interview question is "What's the expected impact on R 2 when adding more features to a
model?" While adding more features to a model always increases the R 2 , that doesn’t necessarily
make for a better model. Since any machine learning model can overfit by having more parameters,
a goodness-of-fit measure like R 2 should likely also be assessed with model complexity in mind.
Metrics that take into account the number of features of linear regression models include AIC, BIC,
Mallow's CP, and adjusted R2 .

Subset Selection
So, how do you reduce model complexity of a regression model? Subset selection. By default, we use
all the predictors in a linear model. However, in practice, it's important to narrow down the number
of features, and only include the most important features. One way is best subset selection, which
tries each model with k predictors, out of p possible ones, where k < p. Then, you choose the best
subset model using a regression metric like R 2 . While this guarantees the best result, it can be
computationally infeasible as p increases (due to the exponential number of combinations to try).
Additionally, by trying every option in a large search space, you're likely to get a model that overfits
with a high variance in coefficient estimates.
Therefore, an alternative is 10 use stepwise selection. In forward stepwise selection, we start with an
empty model and iteratively add the most useful predictor. In backward stepwise selection, we start
with the full model and iteratively remove the least useful predictor. While doing stepwise selection,
we aim to find a model with high R 2 and low RSS, while considering the number of predictors using
metrics like AIC or adjusted R 2 .

Linear Regression Assumptions


Because linear regression is one of the most commonly applied models, it has the honor of also
being one of the most misapplied models. Before you can use this technique, you must validate its
four main assumptions to prevent erroneous results:
 Linearity: The relationship between the feature set and the target variable is linear.
 Homoscedasticity: The variance of the residuals is constant.
 Independence: Ail observations are independent of one another.
 Normality: The distribution of Y is assumed to be normal.
These assumptions are crucial to know. For example, in the figures that follows, there are four lines
of best fit that are the same. However, only in the top left dataset are these four assumptions met.

Ace the Date Science Intaview 106


ACE THE DATA SCIENCE INTERVIEW & SINGH

Note: for the independence and normality assumption, use of the term "i.i.d." (independent and
identically distributed) is also common. If any of these assumptions are violated, any forecasts or
confidence intervals based on the model will most likely be misleading or biased. As a result, the
linear regression model will likely perform poorly out of sample.

Avoiding Linear Regression Pitfalls


Heteroscedasticity
Linear regression assumes that the residuals (the distance between what the model predicted versus
the actual value) are identically distributed. If the variance of the residuals is not constant, then
heteroscedasticity is most likely present, meaning that the residuals are not identically distributed.
To find heteroscedasticity, you can plot the residuals versus the fitted values. If the relationship
between residuals and fitted values has a nonlinear pattern, this indicates that you should try to
transform the dependent variable or include nonlinear terms in the model.

Example of heteroscedasticity. As the x-value increases, the residuals increase too

Another useful diagnostic plot is the scale-location plot, which plots standardized residuals versus
the fitted values. If the data shows heteroscedasticity, then you will not see a horizontal line with
equally spread points.

Ace the Date Science Intaview 107


CHAPTER 7 MACHINE LEARNING

Case 1 Case 2
Scale-Location Scale-Location

-5 0 5

Fitted values Fitted values

Normality
Linear regression assumes the residuals are normally distributed: We can lest this through a QQ plot.
Also known as a quantile plot, a QQ plot graphs the standardized residuals versus theoretical
quantiles and shows whether the residuals appear to be normally distributed (i.e., the plot
resembles a straight line). If the QQ plot is not a reasonably straight line, this is a sign that the
residuals are not normally distributed, and hence, the model should be reexamined. In that case,
transforming the dependent variable (With a log or square-root transformation, for example) can
help reduce skew.

Ace the Data Science Interview Machine Learning


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Outliers
Outliers can have an outsized impact on regression results. There are several ways to identify
outliers. Oile of the more popular methods is examining Cook's distance, which is the estimate of the
influence of any given data point. Cook's distance takes into account the residual and leverage (how
far away the X value differs from that of other observations) of every point. In practice, it can be
useful to remove points with a Cook's distance value above a certain threshold.

Multicollinearity
Another pitfall is if the predictors are correlated. This phenomenon, known as multicollinearity,
affects the resulting coefficient estimates by making it problematic to distinguish the true underlying
individual weights of variables. Multicollinearity is most commonly observed by weights that flip

Ace the Date Science Interview 109


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

magnitude. It is one of the reasons why model weights cannot be directly interpreted as the
importance of a-feature in linear regression. Features that initially would appear to be independent
variables can often be highly correlated: for example, the number of Instagram posts made and the
number of notifications received are most likely highly correlated, since both ate related to user
activity on the platform, and one generally causes another.
One way to assess multicollinearity is by examining the variance inflation factor (V IF), which
quantifies how much the estimated coefficients are inflated when multicollinearity exists. Methods
to address multicollinearity include removing the correlated variables, linearly combining the
variables, or using PCA/PLS (partial least squares).

Confounding Variables
Multicollinearity is an extreme case of confounding, which occurs when a variable (but not the main
independent or dependent variables) affects the relationship between the independent and
dependent variables. This can cause invalid correlations. For example, say you were studying the
effects of ice cream consumption on sunburns and find that higher ice cream consumption leads to a
higher likelihood of sunburn. That would be an incorrect conclusion because temperature is the
confounding variable — higher summer temperatures lead to people eating more ice cream and also
spending more time outdoors (which leads to more sunburn).
Confounding can occur in many other ways, too. For example, one way is selection bias, where the
data are biased due to the way they were collected (for example, group imbalance). Another
problem, known as 0Luüted variable bias, occurs when important variables are omitted, resulting in
a linear regression model that is biased and inconsistent. Omitted variables can stem from dataset
generation issues or choices made during modeling. A common way to handle confounding is
stratification, a process where you create multiple categories or subgroups in which the confounding
variables do not vary much, and then test significance and strength of associations using chi square.

Knowing about these regression edge cases, how to identify them, and how to guard against them is
crucial. This knowledge separates the seasoned data scientists from the data neophyte — precisely
why it's such a popular topic for data science interviews.

Generalized Linear Models


In linear regression, the residuals are assumed to be normally distributed. The generalized linear
model (GLM) is a generalization of linear regression that allows for the residuals to not just be
normally distributed. For example, if Tinder wanted to predict the number of matches somebody
would get in a month, (hey would likely want to use a GLM like the one below with a Poisson
response (called Poisson regression) instead of a standard linear regression. The three common
components to any GLM are:

Link Function Systematic Random


Component Component
i = ? +
yi  Paisson (i)

 Random Component: is the distribution of the error term, i.e., normal distribution för linear
regression.

Ace the Date Science Interview 110


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

 Systematic Component: consists of the explanatory variables, i.e., the predictors combined in a
linear combination.
 Link function: is the link between the random and system components, i.e., a linear regression,
logit regression, etc.
Nole that in GLMs, the response variable is still a linear combination of weights and predictors.
Regression can also use the weights and predictors nonlinearly; the most common examples of this
are polynomial regressions, splines, and general additive models. While interesting, these
techniques are rarely asked about in interviews and thus are beyond the scope of this book.

Classification
General Framework
Interview questions related to classification algorithms are commonly asked during interviews due to
the abundance of real-life applications for assigning categories to things. For example, classifying
users as likely to churn or not, predicting whether a person will click on an ad or not, and
distinguishing fraudulent transactions from legitimate ones are all applications of the classification
techniques we mention in this section.
The goal of classification is to assign a given data point to one of K possible classes instead of
calculating a continuous value (as in regression). The two types of classification models are
generative models and discriminative models. Generative models deal with the joint distribution of X
and Y, which is defined as follows:
p( X ,Y )= p(YlX ) p( X)

Maximizing a posterior probability distribution produces decision boundaries between classes where
the resulting posterior probability is equivalent. The second type of model is discriminative. It
directly determines a decision boundary by choosing the class that maximizes the probability:
^y =arg max p(Y =k∨x )
Thus, both methods choose a predicted class that maximizes the posterior probability distribution;
the difference is simply the approach. While traditional classification deals with just two classes (0 or
1), multi-class classification is common, and many of the below methods can be adapted to handle
multiple labels.

Evaluating Classifiers
Before we detail the various classification algorithms like logistic regression and Naive Bayes it's
essential to understand how to evaluate the predictive power of a classification model.
Say you are trying to predict whether an individual has a rare cancer that only happens to 1 in
10,000 people. By default, you could simply predict that every person doesn't have cancer and be
accurate 99.99% of the time. But clearly, this isn't a helpful model — Pfizer won't be acquiring our
diagnostic test anytime soon! Given imbalanced classes, assessing accuracy alone is not enough —
this is known as the "accuracy paradox" and is the reason why it's critical to look at other measures
for misclassified observations,

Building and Interpreting a Confusion Matrix


When building a classifier, we want to minimize the number of misclassified observations, which in
binary cases can be termed false positives and false negatives. In a false positive, the model

Ace the Date Science Interview 111


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

incorrectly predicts that an instance belongs to the positive class. For the cancer detection example,
a false positive would be classifying an individual as having cancer, when in reality, the person does
not have it. On the other hand, a false negative occurs when the model incorrectly produces a
negative class. In the cancer diagnostic case, this would mean saying a person doesn't have cancer,
when in fact they do.
A confusion matrix helps organize and visualize this information. Each row represents the actual
number of observations in a class, and each column represents the number of observations
predicted as belonging to a class.
Predicted

Positive Negativ
e
False Negative (FN) Sensitivity
Positive True Positive (TP)
Actual Class Type 2 Error
False Positive (FP) Specificity
Negative True Negative (TN)
Type 1 Error
Precision Negative productive Accuracy
Value

Precision and Recall


Two metrics that go beyond accuracy are precision and recall. In classification, precision is the actual
positive proportion of observations Chat were predicted positive by the classifier. In the cancer
diagnostic example, it's the percentage of people you said would have cancer who actually ended up
having the disease. Recall, also known as sensitivity, is the percentage of total positive cases
captured, out of all positive cases. It's essentially how well you do in finding people with cancer.
In real-world modeling, there's a natural trade-off between optimizing for precision or recall. For
example, having high recall — catching most people who have cancer ends up saving the lives of
some people with the disease. However, this often leads to misdiagnosing others who didn't truly
have cancer, which subjects healthy people to costly and dangerous treatments like chemotherapy
for a cancer they never had. On the flip side, having high precision means being confident that when
the diagnostic comes back positive, the person really has cancer. However, this often means missing
some people who truly have the disease. These patients with missed diagnoses may gain a false
sense of security, and their cancer, left unchecked, could lead to fatal outcomes.
During interviews, be prepared to talk about the precision versus recall trade-off. For open-ended
case questions and take-home challenges, be sure to contextualize the business and product impact
of a false positive or a false negative. In cases where both precision and recall are equally Important,
you can optimize the F1 score: the harmonic mean of precision and recall.
2∗precision∗recall
F 1=
precision+recall
Visualizing Classifier Performance
Besides precision, recall, and the F1 score, another popular way to evaluate classifiers is the receiver
operating characteristic (ROC) curve. The ROC curve plots the true positive rate versus the false
positive rate for various thresholds. The area under the curve (AUC) measures how well the classifier
separates classes. The AUC of the ROC curve is between zero and one, and a higher number means

Ace the Date Science Interview 112


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

the model performs better in separating the classes. The most optimal is a curve that "hugs" the top
left of the plot, as shown below. This indicates that a model has a high true-positive rate and
relatively low false-positive rate.

ROC Curve

False Positive Rate

Logistic Regression
One of the most popular classification algorithms is logistic regression, and it is asked about almost
as frequently as linear regression during interviews. In logistic regression, a linear output is
converted into a probability between 0 and 1 using the sigmoid function:
1
S ( x )= −xβ
1+ e
In the equation above, X is the set of predictor features and is the corresponding vector of weights.
Computing S(x) above produces a probability that indicates if an observation should be classified as a
"1 " (if the calculated probability is at least 0.5), and a "O" otherwise.
P ( Y^ =1| X )=S( Xβ)
Linear Regression Logistic Regression

Ace the Date Science Interview 113


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Independent Variable Independent Variable

The loss function for logistic regression, also known as log-loss, is formulated as follows:

( S ( 1Xβ) )+ (1− y ) log ( 1−S1(Xβ) )


n
L ( w ) =∑ y i log i
i=1
Note that in cases where more than two outcome classes exist, softmax regression is a commonly
used technique that generalizes logistic regression.
In practice, logistic regression, much like its cousin linear regression, is often used because it is highly
interpretable: its output, a predicted probability, is easy to explain to decision makers. Additionally,
its quickness to compute and ease of use often make it the first model employed for classification
problems in a business context.
Note, however, that logistic regression does not work well under certain circumstances. Its relative
simplicity makes it a high-bias and low-variance model, so it may not perform well when the decision
boundary is not linear. Additionally, when features are highly correlated, the coefficients won't be as
accurate. To address these cases, you can use techniques similar to those used in linear regression
(regularization, removal of features, etc.) for dealing with this issue, For interviews, it is critical to
understand both the mechanics and pitfalls of logistic regression.

Naive Bayes
Naive Bayes classifiers require only a small amount of training data to estimate the necessary
parameters. They can be extremely fast compared to more sophisticated methods (such as support
vector machines). These advantages lead to Naive Bayes being a popularly used first technique in
modeling, and is why this type of classifier shows up in interviews.
Naive Bayes uses Bayes' rule (covered in Chapter 6: Statistics) and a set of conditional independence
assumptions in order to learn P(YIX). There are two assumptions to know about Naive Bayes:
1.It assumes each X i is independent of any other x j given Y for any pair of features X i and X j.
2.It assumes each feature is given the same weight.
The decoupling of the class conditional feature distributions means that each distribution can be
independently estimated as a one-dimensional distribution. That is, we have the following:
n
P ( X 1 … X n|Y ) =∏ P (¿ X i∨Y ) ¿
t=1

Using the conditional independence assumption, and then applying Bayes’ theorem, the
classification rule becomes:

Ace the Date Science Interview 114


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

^y =arg¿ max p ( Y = y i) ∏ P ( X 1|Y = y 1 ¿


j

To understand the beauty of Naive Bayes, recall that for any ML model having k features, there are 2 k
possible feature interactions (the correlations between them all). Due to the large number of feature
interactions, typically you'd need 2k data points for a high-performing model. However, due to the
conditional independence assumption in Naive Bayes, there only need to be k data points, which
removes this problem.
For text classification (e.g., classifying spam, sentiment analysis), this assumption is convenient since
there are many predictors (words) that are generally independent of one another.
While the assumptions simplify calculations and make Naive Bayes highly scalable to run, they are
often not valid. In fact, the first conditional independence assumption generally never holds true,
since features do tend to be correlated. Nevertheless, this technique performs well in practice since
most data is linearly separable.

SVMs
The goal of S VM is to form a hyperplane that linearly separates the training data. Specifically, it aims
to maximize the margin, which is the minimum distance from the decision boundary to any training
point. The points closest to the hyperplane are called the support vectors. Note that the decision
boundaries for SVMs can be nonlinear, which is unlike that of logistic regression, for example.

In the image above, it's easy to visualize how a line can be found that separates the points correctly
into their two classes. In practice, splitting the points isn't that straightforward. Thus, SVMs rely on a
kernel to transform data into a higherdimensional space, where it then finds the hyperplane that
best separates the points. The image below visualizes this kernel transformation:

Ace the Date Science Interview 115


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

Input Space Feature Space

Mathematically, the kernel generalizes the dot product to a higher dimension:


T
k ( x , y )=( x ) ( y )
The RBF (radial basis function) and Gaussian kernels are the two most popular kernels used in
practice. The general rule of thumb is this: for linear problems, use a linear kernel, and for nonlinear
problems, use a nonlinear kernel like RBF. SVMs can be viewed as a kernelized form of ridge
regression because they modify the loss function employed in ridge regression.
SVMs work well in high-dimensional spaces (a larger number of dimensions versus the number of
data points) or when a clear hyperplane divides the points. Conversely, SVMs don't work well on
enormous data sets, since computational complexity is high, or when the target classes overlap and
there is no clean separation. Compared to simpler methods with linear decision boundaries such as
logistic regression and Naive Bayes, you may want to use SVMs if you have nonlinear decision
boundaries and/or a much smaller amount of data. However, if interpretability is important, SVMs
are not preferred because they do not have simple-to-understand outputs (like logistic regression
does with a probability).
For ML-heavy roles, know when to use which kernel, the kernel trick, and the underlying
optimization problem that SVMs solve.

Decision Trees
Decision trees and random forests are commonly discussed during interviews since they are flexible
and often perform well in practice for both classification and regression use cases. Since both use
cases are possible, decision trees are also known as CART (classification and regression trees). For
this section, we'll focus on the classification use case for decision trees. While reading this section,
keep in mind that for interviews, it helps to understand how both decision trees and random forests
are trained. Related topics of entropy and information gain are also crucial to review before a data
science interview.

Training
A decision tree is a model that can be represented in a treelike form determined by binary splits
made in the feature space and resulting in various leaf nodes, each with a different prediction. Trees
are trained in a greedy and recursive fashion, starting at a root node and subsequently proceeding
through a series of binary splits in features (i.e., variables) that lead to minimal error in the
classification of observations.
Survival of Passengers on the Titanic

gender

Ace the Date Science Interview 116


ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH

male female

age Survived
0.73 36%

95 < age age < = 95

died SBISP
0.17 60%

3 < = sbisp < 3


sbisp
died Survived
0.02 2% 0.89 2%

Entropy
The entropy of a random variable Y quantifies the uncertainly in Y. Fora discrete variable Y (assuming
k stales) it is stated as follows:
h
H ( Y )=−∑ P ( y=k ) log P(Y =k )
i=1

For example, for a simple Bernoulli random variable, this quantity is highest when p = 0.5 and lowest
when p = 0 or p = 1, a behavior that aligns intuitively with its definition since if p = 0 or 1, then there
is no uncertainty with respect to the result. Generally, if a random variable has high entropy, its
distribution is closer to uniform than a skewed one. There are many measures of entropy — in
practice, the Gini index is commonly used for decision trees.
In the context of decision trees, consider an arbitrary split. We have H(Y) from the initial training
labels and assume that we have some feature X on which we want to split. We can characterize the
reduction in uncertainty given by the feature X, known as information gain, which can be formulated
as follows:
IG(Y , X )=H (Y )−H (Y ∨ X)
The larger IG(Y, X) is, the higher the reduction in uncertainty in Y by splitting on X. Therefore, the
general process assesses all features in consideration and chooses the feature that maximizes this
information gain, then recursively repeals the process on the two resulting branches.

Random Forests
Typically, an individual decision tree may be prone to overfilling because a leaf node can be created
for each observation. In practice, random forests yield better out-ofsample predictions than decision

Ace the Date Science Interview 117


ACE THE DATA SCIENCE INTERVIEW I & SINGH

trees. A random forest is an ensemble method that can utilize many decision trees, whose decisions
it averages.
Two characteristics of random forests allow a reduction in overfitting and the correlation between
the trees. The first is bagging, where individual decision trees are fitted following cach bootstrap
sample and then averaged afterwards. Bagging significantly reduces the variance of the random
forest versus the variance of any individual decision trees. The second way random forests reduce
overfitting is that a random subset of features is considered at each split, preventing the important
features from always being present at the tops or individual trees.
Random forests are often used due to their versatility, interpretability (you can quickly see feature
importance), quick training times (they can be trained in parallel), and prediction performance. In
interviews, you'll be asked about how they work versus a decision tree, and when you would use a
random forest over other techniques.

Test Sample Input

Boosting
Boosting is a type of ensemble model that trains a sequence of "weak" models (such as small
decision trees), where each one sequentially compensates for the weaknesses of the preceding
models. Such weaknesses can be measured by the current model's error rate, and the relative error
rates can be used to weigh which observations the next models should focus on. Each training point
within a dataset is assigned a particular weight and is continually re-weighted In an iterative fashion
such that points that are mispredicted take on higher weights in each iteration. In this way, more
emphasis is placed on points that are harder to predict. This can lead to overfitting if the data is
especially noisy.
One example is AdaBoost (adaptive boosting), which is a popular technique used to train a model
based on tuning a variety of weak learners. That is, it sequentially combines decision frees with a
single split, and then weights arc uniformly set for all data points. At each iteration, data points are
re-weighted according to whet each was classified correctly or incorrectly by a classifier. Al the end,
weighted predictions or each classifier are combined to obtain a final prediction.

Ace the Date Science Intaview 118


ACE THE DATA SCIENCE INTERVIEW I & SINGH

The generalized form of AdaBoost is called gradient boosting. A well-known form of gradient
boosting used in practice is called XGBoost (extreme gradient boosting). Gradient boosting is similar
to AdaBoost, except that shortcomings of previous models are identified by the gradient rather than
high weight points, and all classifiers have equal weights instead of having different weights. In
industry, XGBoost is used heavily due to its execution speed and model performance.
Since random forests and boosting are both ensemble methods, interviewers tend to ask questions
comparing and contrasting the two. For example, one of the most common interview questions is
"What is the difference between XGBoost and a random forest?"

Dimensionality Reduction
Imagine you have a dataset with one million rows but two million features, most of which are null
across the data points. You can intuitively guess that it would be hard to tease out which features are
predictive for the task at hand. In geometric terms, this situation demonstrates sparse data spread
over multiple dimensions, meaning that each data point is relatively far away from other data points.
This lack of distance is problematic, because when extracting patterns using machine learning, the
idea of similarity or closeness of data often matters a great deal. If a particular data point has
nothing close to it, how can an algorithm make sense of it?
This phenomenon is known as the curse of dimensionality. One way to address this problem is to
increase the dataset size, but often, in practice, it's costly or infeasible to get more training data.
Another way is to conduct feature selection, such as removing multicollinearity, but this can be
challenging with a very large number of features.
Instead, we can use dimensionality reduction, which reduces the complexity of the problem with
minimal loss of important information. Dimensionality reduction enables you to extract useful
information from such data, but can sometimes be difficult or even too expensive, since the
algorithm we would use would incorporate so many features. Decomposing the data into a smaller
set of variables is also useful for summarizing and visualizing datasets. For example, dimensionality
reduction methods can be used to project a large dataset into 2D or 3D space for easier visualization.

Principal Components Analysis


The most commonly used method to reduce the dimensionality of a dataset is principal components
analysis (PCA). PCA combines highly correlated variables into a new, smaller set of constructs called
principal components, which capture most of the variance present in the data. The algorithm looks
for a small number of linear combinations for each row vector to explain the variance within X. For
example, in the image below, the variation in the data is largely summarized by two principal
components.
More specifically, PCA finds the vector w of weights such that we can define the following linear
combination:
p
y i=w x=∑ wij x j
T
i
j=1
subject to the following:
y i is uncorrelated with y j, var( y i) is maximized

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Hence, the algorithm proceeds by first finding the component having maximal variance. Then, the
second component found is uncorrelated with the first and has the second-highest variance, and so
on for the other components. The algorithm ends with some number k dimensions such that
y 1 , … y k explain the majority of k variance, k << p
The final result is an eigendecomposition of the covariance matrix of X, where the first principal
component is the eigenvector corresponding to the largest eigenvalue and the second principal
component corresponds to the eigenvector with the second largest eigenvalue, and so on. Generally,
the number of components you choose is based on your threshold for the percent of variance your
principal components can explain. Note that while PCA is a linear dimensionality reduction method,
t-distributed stochastic neighbor embedding (t-SNE) is a non-linear, non-deterministic method used
for data visualization.
In interviews, PCA questions often test your knowledge Of the assumptions (like that the variables
need to have a linear relationship). Commonly asked about as well are pitfalls of PCA, like how it
struggles with outliers, or how it is sensitive to the units of measurement for the input features (data
should be standardized). For more ML-heavy roles, you may be asked to whiteboard the
eigendecomposition.

Clustering
Clustering is a popular interview topic since il is the most commonly employed unsupervised
machine learning technique. Recall that unsupervised teaming means that there is no labeled
training data, i.e., the algorithm is trying to infer structural patterns within the data, without a
prediction task in mind. Clustering is often done to find "hidden" groupings in data, like segmenting
customers into different groups, where the customers in a group have similar characteristics.
Clustering can also be used for data visualization and outlier identification, as in fraud detection, for
instance. The goal of clustering is to partition a dataset into various clusters or groups by looking only
at the data's input features.
Ideally, the clustered groups have two properties:
 Points within a given cluster are similar (i.e., high intra-cluster similarity).

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

 Points in different clusters are not similar (i.e., low inter-cluster similarity).

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

K-Means clustering
A well-known clustering algorithm, k-means clustering is often used because it is easy to interpret
and implement. It proceeds, first, by partitioning a set of data into k distinct clusters and then
arbitrarily selects centroids of each of these clusters. It iteratively updates partitions by first
assigning points to the closest cluster, then updating centroids, and then repeating this process until
convergence. This process essentially minimizes the total inter-cluster variation across all clusters.

K-Means

Mathematically, k-means clustering reaches a solution by minimizing a loss function (also known as
distortion function). In this example, we minimize Euclidean distance (given x i points and centroid
value μ j):
h
L−∑ ∑ ‖x i−μ j‖
2

j=1 xε S j

where S j represents the particular cluster.


The iterative process continues until the cluster assignment updates fail to improve the objective
function. Note that k-means clustering uses Euclidean distance when assessing how close points are
to one another and that k, the number of clusters to be estimated, is set by the user and can be
optimized if necessary.

K-means Alternatives
One alternative to k-means is hierarchical clustering. Hierarchical clustering assigns data points to
their own cluster and merges clusters that are the nearest (based on any variety of distance metrics)
until there is only one cluster left, generally visualized using a dendrogram. In cases where there is
not a specific number of clusters, or you want a more interpretable and informative output,
hierarchical lustering is more useful than k-means.
While quite similar to k-means, density clustering is another distinct technique. The most well-
known implementation of this technique is DBSCAN. Density clustering does not require a number of
clusters as a parameter. Instead, it infers that number, and leams to identify clusters of arbitrary
shapes. Generally, density clustering is more helpful for outlier detection than k-means.

Gaussian Mixture Model (GMM)


A GMM assumes that the data being analyzed come from a "mixture" of k Gaussian/normal
distributions, each having a different mean and variance, where the mixture components are
basically the proportion of observations in each group. Compared to k-means, which is a
deterministic algorithm where k is set in advance, GMMs essentially try to leam the true value of k.

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

For example, TikTok may be on the lookout for anomalous profiles, and can use GMMs to cluster
various accounts based on features (number of likes sent, messages sent, and comments made) and
identify any accounts whose activity metrics don't seem to fall within the typical user activity
distributions.
Cluster 2

Compared to k-means, GMMs are more flexible because k-means only takes into account the mean
of a cluster, while GMMs take into account the mean and variance. Therefore, GMMs are particularly
useful in cases with low-dimensional data or where cluster shapes may be arbitrary. While practically
never asked about for data science interviews (compared to k-means), we brought up GMMs for
those seeking more technical ML research and ML engineering positions.

Neural Networks
While the concepts behind neural networks have been around since the 1950s, it's only in the last
15 years that they've grown in popularity, thanks to an explosion of data being created, along with
the rise of cheap cloud computing resources needed to store and process the massive amounts of
newly created data. As mentioned earlier in the chapter, if your resume has any machine learning
projects involving deep learning experience, then the technical details behind neural networks will
be considered fair game by most interviewers, But for a product data science position or a finance
role (where data can be very noisy, so most models are not purely neural networks), don't expect to
be bombarded with tough neural network questions. Knowing the basics of classical ML techniques
should suffice.
When neural nets are brought up during interviews, questions can range anywhere from qualitative
assessments on how deep learning compares to more traditional machine learning models to
mathematical details on gradient descent and backpropagation. On the qualitative side, it helps to
understand all of the components that go into training neural networks, as well as how neural
networks compare to simpler methods.

Perceptron
Neural networks function in a way similar to biological neurons. They take in various inputs (at input
layers), weight these inputs, and then combine the weighted inputs through a linear combination
(much like linear regression). If the combined weighted output is past some threshold set by an
activation function, the output is then sent out to other layers. This base unit is generally referred to

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

as a perceptron. Perceptrons are combined to form neural networks, which is why they are also
known as multi-layer perceptrons (MLPs).

Single Layer Perception

While the inputs for a neural network are combined via a linear combination, often, the activation
function is nonlinear. Thus, the relationship between the target variable and the predictor features
(variables) frequently ends up also being nonlinear. Therefore, neural networks are most useful
when representing and leaming nonlinear functions.
For reference, we include a list of common activation functions below. The scope of when to use
which activation function is outside of this text, but any person interviewing for an ML-intensive role
should know these use cases along with the activation function's formula.

Activation Function Equation Example ID Graph


Unit Step (Heaviside) 0, Perception variant
0.5, z 0
1,

Sign (Signum) —1, Perception variant


0.5, z —O
1,
Linear Adaline, linear
(z) = z regression

Piece-wise linear Support vector


machine

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Logistic (sigmoid) Logistic regression


Multi-layer NN

Hyperbolic tangent Multi-layer


Neural Network

Ace the Science Interview


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Rectifier, ReLU max (0. z) Multi-layer


(Rectifier Linear Unit) Neural Network

Rectifier. softplus = In (1 + el) Multi-layer


Neural Network

In neural networks, the process of receiving inputs and generating an output continues until an
output layer is reached. This is generally done in a forward manner, meaning that layers process
incoming data in a sequential forward way (which is why most neural networks are known as "feed-
forward"). The layers of-neurons that are not the input or output layers are called the hidden layers
(hence the name "deep learning" for neural networks having many of these). Hidden layers allow for
specific transformations of the data within each layer. Each hidden layer can be specialized to
produce a particular output — for example, in a neural network used for navigating roads, one
hidden layer may identify stop signs, and another hidden layer may identity traffic lights. While those
hidden layers are not enough to independently navigate roads, they can function together within a
larger neural network io drive better than Nick at age 16.

Input Data
Output

Layer 1 Layer N

Hidden Layer

Backpropagation
The learning process for neural networks is called back-propagation, This technique modifies the
weights of the neural network iteratively through calculation of deltas between predicted and
expected outputs. After this calculation the weights are updated backward through earlier layers via
stochastic gradient descent. This process continues until the weights that minimize the loss function
are found.
For regression tasks; the commonly used loss function to be, optimized is squared error, whereas for
classification tasks the common loss function used is cross-entropy Given a loss function L, we can
update the weights through the chain rule, of the following form, where z is the model’s output (and
the best guess of our Larget variable y):

Ace the Data Science Interview


ACE THE DATA SCIENCE I

∂ L(z , y)
∗∂ x
∂x
∗∂ z
∂ L(z , y ) ∂z
=
∂w ∂w
and the weights are updated via:
∂ L(z , y )
w=w−a
∂w
For ML-heavy roles, we've seen interviewers expect candidates to explain the technical details
behind basic backpropagation on a whiteboard, for basic methods such as linear regression or
logistic regression.
Interviewers also like to ask about the hyperparameters involved in neural networks. For example,
the amount that the weights are updated during each training step, a, is called the learning rate. If
the learning rate is 100 small, the optimization process may freeze. Conversely, if the-learning rate is
too large, the optimization might converge prematurely at a suboptimal solution. Besides the
learning rate, other hyperparameters in neural networks include the number of hidden layers, the
activation functions used, batch size, and so on. For an interview, it's helpful to know how each
hyperparameter affects a neural network's training time and model performance.

Training Neural Networks


Because neural networks require a gargantuan amount of weights to train, along with a considerable
amount of hyperparameters to be searched for, the training process for a neural network can run
into many problems.

General Framework
One issue that can come up in training neural nets is the problem of vanishing gradients Vanishing
gradients refers to the fact that sometimes the gradient of the loss function will be tiny, and may
completely stop the neural network from training because the weights aren't updated properly.
Since backpropagation uses the chain rule, multiplying n small numbers to compute gradients for
early layers in a network means that the gradient gets exponentially smaller With more layers. This
can happen particularly with traditional activation functions like hyperbolic tangent, whose gradients
range between zero and one, The opposite problem, where activation functions p create large
derivatives is known as the exploding gradient problem
One common technique to address extremes in gradient values is to allow gradients from later layers
to directly pass into earlier layers without being multiplied many times — something which residual
neural networks (ResNets) and LSTMs both utilize. Another approach to prevent extremes in the
gradient values is to alter the magnitude of the gradient changes by changing the activation function
used (for example, ReLU). The details behind these methods are beyond this book's scope but are
worth looking into for ML-heavy interviews.

Training Optimization Techniques


Additionally, there are quite a few challenges in using vanilla gradient descent to train deep learning
models. A few examples include getting trapped in suboptimal local minima or saddle points, not
using a good learning rate, or dealing with sparse data with features of different frequencies where
we may not want all features to update to the same extent. To address these concerns, there are a
variety of optimization algorithms used.

Ace the Date Science Intaview 127


ACE THE DATA SCIENCE I

Momentum is one such optimization method used to accelerate learning while using SGD. While
using SGD, we can sometimes see small and noisy gradients. To solve this, we can introduce a new
parameter, velocity, which is the direction and speed at which the learning dynamics change. The
velocity changes based on previous gradients (in an exponentially decaying manner) and increases
the step size for learning in any iteration, which helps the gradient maintain a consistent direction
and pace throughout the training process.
Transfer Learning
Lastly, for training neural networks, practitioners use or repurpose a pre-trained layer (components
of a model that have already been trained and published). This approach is called transfer learning
and is especially common in cases where models require a large amount of data (for example, BERT
for language models and ImageNet for image classification). Transfer leaming is beneficial when you
have insufficient data for a new domain, and there is a large pool of existing data that can be
transferred to the problem of interest. For example, say you wanted to help Jian Yang from the TV
show Silicon_Valley build an app to detect whether something was a hot dog or not a hotdog. Rather
than just using your 100 images of hot dogs, you can use ImageNet (which was trained on many
millions of images) to get a great model right off the bat, and then layer on any extra specific (raining
data you might have to further improve accuracy.

Addressing Overfitting
Deep neural networks are prone to overfilling because of the model complexity (there are many
parameters involved). As such, interviewers frequently ask about the variety of techniques which are
used to reduce the likelihood of a neural network overfitting. Adding more training data is the
simplest way to address variance if you have access to significantly more data and computational
power to process that data. Another way is to standardize features (so each feature has 0 mean and
unit variance), since this speeds up the learning algorithm. Without normalized inputs, each feature
takes on a wide range of values, and the corresponding weights for those features can vary
dramatically, resulting in larger updates in backpropagation. These large updates may cause
oscillation in the weights during the learning stage, which causes overfitting and high variance.
Batch normalization is another technique to address overfitting. In this process, activation values are
normalized within a given batch so that the representations at the hidden layers do not vary
drastically, thereby allowing each layer of a network to team more independently of one another.
This is done for each hidden neuron, and also improves training speed. Here, applying a
standardization process similar to how inputs are standardized is recommended.
Lastly, dropout is a regularization technique that deactivates several neurons randomly at each
training step to avoid overfitting. Dropout enables simulation of different architectures, because
instead of a full original neural network, there will be random nodes dropped at each layer. Both
batch normalization and dropout help with regularization since the effects they have are similar to
adding noise to various parts of the training process.

Types of Neural Networks


While a deep dive into the various neural network architectures is beyond the scope of this book,
below, we jog your memory with a few of the most popular ones along with their applications. For
ML-heavy roles, interviewers tend to ask about the different layer types used within each
architecture, and to compare and contrast each architecture against one another.

CNNs

Ace the Date Science Intaview 128


ACE THE DATA SCIENCE I

Convolutional neural networks (CNNs) are heavily used in computer vision because they can capture
the spatial dependencies of an image through a series of filters. Imagine you were looking at a
picture

Ace the Date Science Intaview 129


CHAPTER 7 MACHINE LEARNING

:
of some traffic lights. Intuitively, you need to figure out the components of the lights (i.e., red, green,
yellow) when processing the image. CNNs can determine elements within that picture by looking at
various neighborhoods of the pixels in the image. Specifically, convolution layers can extract features
such as edges, color, and gradient orientation. Then, pooling layers apply a version of dimensionality
reduction in order to extract the most prominent features that are invariant to rotation and position.
Lastly, the results are mapped into the final output by a fully connected layer.

FEATURE LEARNING CLASSIFICATION

RNNs
Recurrent neural networks (RNNs) are another common type of neural network. In an RNN, the
nodes form a directed graph along a temporal sequence and use their internal state (called
memory). RNNs are often used in learning sequential data such as audio or video — cases where the
current context depends on past history. For example, say you are looking through the frames of a
video. What will happen in the next frame is likely to be highly related to the current frame, but not
as related to the first frame of the videos Therefore, when dealing With sequential data, having a
notion of memory is crucial for accurate predictions. In contrast to CNNs, RNNs can handle arbitrarily
input and output lengths and are not feed-forward neural networks, instead using this internal
memory to process arbitrary sequences of data.

LSTMs
Long Short-Term Memory (LSTMs) are a fancier version of RNNs. In LSTMs, a common unit is
composed of a cell, an input gate (writing to a cell or not), an output gate (how much to write to a
cell), and a forget gate (how much to erase from a cell). This architecture allows for regulating the
flow of information into and out of any cell. Compared to vanilla RNNs, which only learn short-term
dependencies, LSTMs have additional properties that allow them to learn long-term dependencies.
Therefore, in most real-world scenarios, LSTMs are used instead of RNNs.

Reinforcement Learning
Reinforcement learning (RL) is an area of machine learning outside of supervised and unsupervised
learning. RL is about teaching an agent to learn which decisions to make in an environment to
maximize some reward function. The agent takes a series of actions throughout a variety of states
and is rewarded accordingly, During the learning process, the agent receives feedback based on the
actions taken and aims to maximize the overall value acquired.

130 Ace the Data Science Interview Machine Learning


ACE THE DATA SCIENCE INTERVIEW I SINGH

The main components of an RL algorithm include:


 Reward function: defines the goal of the entire RL problem and quantifies what a "good" or
“bad" action is for the agent in any given state.
 Policy: defines how the agent picks its actions by mapping states to actions.
 Model: defines how the agent predicts what to do next as the agent understands the
environment — given a state and action, the model will predict the reward and the next state.
 Value function: predicts the overall expected future reward discounted over time; it is
consistently re-estimated over time to optimize long-term value.
RL is most widely known for use cases in gaming (AlphaGo, chess, Starcraft, etc.) and robotics. It is
best used when the problem at hand is one of actions rather than purely predictions (that is, you do
not know what constitutes "good" actions — supervised learning assumes you already know what
the output should be). Unless you have particular projects or experience with reinforcement
leaming, it won't be brought up in data science interviews.

The End-to-End ML Workflow


End-to-end machine learning questions are asked in interviews to see how well you apply machine
learning theory to solve business problems. it isn't just about confronting a real-world problem like
"How would you design Uber's surge pricing algorithm?" and then jumping to a technique like linear
regression or random forests. Instead, it's about asking the right questions about the business goals
and constraints that inform the machine learning system design. It's about walking through how
you'd explore and clean the data and the features you would by to engineer, It's about picking the
right model evaluation metrics and modeling techniques, contextualizing model performance in
terms of business impact, mentioning model deployment strategies, and much, much more.
To help you solve these all-encompassing problems — something most ML-theory textbooks don't
cover — we walk you through the entire machine learning workflow below. However, to really ace
these open-ended ML problems come interview time, also read Chapter 11 : Case Studies. In
addition to Lips on dealing with open-ended problems, the case study chapter includes ML case
interview questions that force you to apply the concepts detailed below.

Step 1: Clarify the Problem and Constraints


If you frame the business and product problem correctly, you've done half the work. That's because
it's easy to throw random ML techniques at data — but it's harder to understand the business
motivations, technical requirements, and stakcholder concerns that ultimately affect the success of a
deployed machine learning solution. As such, make sure to start your answer by discussing what the
problem at hand is, the business process and context surrounding the problem, any assumptions you
might have, and what prospective stakeholder concerns are likely to be.
Some questions you can use to clarify the problem and the constraints include:
 What is the dependent variable we are trying to model? For example, if we are building a user
churn model, what criteria are we using to define churn in the first place?
 How has the problem been approached in the past by the business? Is there a baseline
performance we can compare against? How much do we need to beat this baseline for the
project to be considered a success?

Ace the Date Science Intaview 131


ACE THE DATA SCIENCE INTERVIEW I SINGH

 Is ML even needed? Maybe a simple heuristics or a rules-based approach works well enough? Or
perhaps a hybrid approach with humans in the loop would work best?
 Is it even legal or ethical to apply ML to this problem? Are there regulatory issues at play
dictating what kinds of data or models you can use? For example, lending institutions cannot
legally use some demographic variables like race.
 How do end users benefit from the solution, and how would they use the solution (as a
standalone, or an integration with existing systems)?
 Is there a clear value add to the business from a successful solution? Are there any other
stakeholders who would be affected?
 If an incorrect prediction is made, how will it impact the business? For example, a spam email
making its way into your inbox isn't as problematic as a high-risk mortgage application
accidentally being approved.
 Does ML need to solve the entire problem, end-to-end, or can smaller decoupled systems be
made to solve sub-problems, whose output is then combined? For example, do you need to
make a full self-driving algorithm, or separate smaller algorithms for environment perception,
path planning, and vehicle control?
Once we understood the business problem that your ML solution is trying to solve, you can clarify
some of the technical requirements. Aligning on the technical requirements is especially important
when confronted with a ML systems design problem. Some questions to ask to anchor the
conversation:
 What's the latency needed? For example, search autocomplete is useless if it takes predictions
longer to load than it takes users to type out their full query. Does every part of the system need
to be real time—while inference may need to be fast, can training be slow?
 Are there any throughput requirements? How many predictions do you need to serve every
minute?
 Where is this model being deployed? Does the model need to fit on-device? If so, how big is too
big to deploy? And how costly is deployment? For example, adding a high-end GPU to a car is
feasible cost-wise, but adding one to a drone might not be.
While spending so much time on problem definition may seem tedious, the reality is that defining
the right solution for the right problem can save you many weeks of technical work and painful
iterations later down the road. That's why interviewers, when posing open-ended ML problems,
expect you to ask the right questions — ones that scope down your solution. By clarifying these
constraints and objectives up front, you make better decisions on downstream steps of the end-to-
end workflow. To further your business and product clarification skills, read the sections on product
sense and company research in Chapter 10: Product Sense.
And one last piece of advice: don’t go overboard with the questions! Remember, this is a time-bound
interview, so make sure your questions and assumptions are reasonable and relevant (and concise).
You don’t want to be like a toddler and ask 57 questions without getting anywhere.

Step 2: Establish Metrics


Once you 've understood the stakeholder objectives and constraints imposed, it's best to pick simple,
observable, and attributable metrics that encapsulate solving the problem. Note that sometimes the
business is only interested in optimizing their existing business KPIs (for example, the time to resolve
a customer request), In that case, you need to be able to align your model performance metrics with

Ace the Date Science Intaview 132


ACE THE DATA SCIENCE INTERVIEW SINGH

solving the business problem. For example, for a customer support request classification model, a
90% model accuracy means that 50% of the customer tickets that previously needed to be rerouted
now end up in the right place, resulting in a 10% decrease in time to resolution.
In real-world scenarios, it's best to opt for a single metric rather than picking multiple metrics to
capture different sub-goals. That's because a single metric makes it easier to rank model
performance. Plus, it's easier to align the team around optimizing a single number. However, in
interview contexts, it may be beneficial to mention multiple metrics to show you've thought about
the various goals and trade-offs your ML solution needs to satisfy. As such, in an interview, we
recommend you stan your answer with a single metric, but then hedge your answer by mentioning
other potential metrics to track.
For example, if posed a question about evaluation metrics for a spam classifier, you could start off by
talking about accuracy, and then move on to precision and recall as the conversation becomes more
nuanced. In an effort to optimize a single metric, you could recommend using the F-1 score. A
nuanced answer could also incorporate an element of satisficing — where a secondary metric is just
good enough, For example, you could optimize precision @ recall 0.95 — i.e., constraining the recall
to be at least 0.95 while optimizing for precision. Or you could suggest blending multiple metrics into
one by weighting different sub-metrics, such as false positives versus false negatives, to create a final
metric to track. This is often known as an OEC (overall evaluation criterion), and gives you a balance
between different metrics.
Once you've picked a metric, you need to establish what success looks like. While for a classifier, you
might desire 100% accuracy, is this a realistic bar for measuring success? Is there a threshold that's
good enough? This is why inquiring about baseline performance in Step 1 becomes crucial. If
possible, you should use the performance of the existing setup for comparison (for examples if the
average time to resolution for customer support tickets is 2 hours, you could aim for 1 hour — not a
97% ticket classification accuracy). Note: in real-world scenarios, the bar for model performance isn't
as high as you'd think to still have a positive business impact.
Be sure to voice all these metric considerations to your interviewer so that you can show you've
thought critically about the problem. For more guidance, read Chapter 10: Product Sense, which
covers the nuances and pitfalls of metric selection.

Step 3: Understand Your Data Sources


Your machine learning model is only as good as the data it sees; hence, the phrase "garbage in,
garbage out." For classroom projects or Kaggle competitions, there isn't much you can do since you
usually have a fixed dataset, and it's all about fitting a model that maximizes some metric. However,
in the real world, you have leeway in what data you use to solve the business problem. As such,
clearly articulate what data sources you would prefer to solve the interview problem. While it's
intuitive to use the internal company data relevant to the problem at hand, that's not the be-all end-
all of data sourcing.
For open-ended ML questions, especially at startups that might be testing your scrappiness, you
should think outside the box regarding what data to use. Data sources to consider:
 Can you acquire more data by crowdsourcing it via Amazon Mechanical Turk?
 Can you ask users for data as part of the user onboarding process?
 Can you buy second- and third-party datasets?
 Can you ethically scrape the data from online sources?
 Can you send your unlabeled internal data off to a labeling and annotation service?

Ace the Date Science Intaview 133


DATA SCIENCE INTERVIEW I HIJO SINGH

To boost model performance, it might not be about collecting more data generally. Instead, you can
intentionally source more examples of edge cases via data augmentation or artificial data synthesis.
For example, suppose your traffic light detector struggles in low-contrast situations. You could make
a vision of your training images that has less contrast in order to give your neural network more
practice on these trickier photos. Taken to the extreme, you can even simulate the entire
environment, as is common in the self-driving car industry. Simulation is used in the autonomous
vehicle space because encountering the volume of rare and risky situations needed to adequately
train a model based on only real-world driving is infeasible.
Finally, do you understand the data? Questions to consider:
 How fresh is the data? How often will the data be updated?
 Is there a data dictionary available? Have you talked to subject matter experts about it?
 How was the data collected? Was there any sampling, selection, or response bias?

Step 4: Explore Your Data


A good first step in exploratory data analysis is to profile the columns at first glance: which ones
might be useful? Which ones have practically no variance and thus wouldn't offer up any real
predictive value? Which columns look noisy? Which ones have a lot of missing or odd values?
Besides skimming through your data, also look at summary statistics like the mean, median, and
quantiles.
“The greatest value of a picture is when it forces
us to notice what we never expected to see.”
—John Tukey
Because a picture is worth a thousand words, visualizing your data is also a crucial step in
Exploratory data analysis. For columns of interest, you want to visualize their distributions to
understand their statistical properties like skewness and kurtosis. Certain features (e.g., age, weight)
may be better visualized with a histogram through binning. It also helps to visualize the range of
continuous variables and plot categories of categorical variables. Finally, you can visually inspect the
basic relationships between variables using a correlation matrix. This can help you quickly spot which
variables are correlated with one another, as well as what might be correlated with the target
variable at hand

Step 5: Clean Your Data


Did you get suckered into data science and machine learning because you thought most of your time
would be spent doing sexy data analysis and modeling work? Don't worry, we got fooled too! The
joke (and grievance) that most data scientists spend 80% of their time cleaning data stems from a
frustrating truth: if you feed garbage data into a model, you get garbage out. Between the logging
issues, missing values, data entry errors, data merges gone wrong, changing schemas due to product
changes, and columns whose meaning changes over time, there are many reasons why your data
might be a hot mess. That's precisely why a critical step before modeling is data munging.
One aspect of data munging is dropping irrelevant data or erroneously duplicated rows and columns.
You should also handle incorrect values that don't match up with the supposed data-schema. For
example, for fields containing human-inputted data, there is often a pattern of errors or typos that
you could find and then use to clean up.
To handle missing data, you should first understand the root cause. Based on the results, several
techniques to deal with missing values include:

Ace the Data Science Interview


DATA SCIENCE INTERVIEW I HIJO SINGH

 Imputing the missing values via basic methods such as column mean/median.

 Using a model or distribution to impute the missing data


 Dropping the rows with missing values (as a last resort)
Another critical data cleaning step is dealing with outliers. Outliers may be due to issues in data
collection, like manual data entry issues or logging hiccups. Or maybe they accurately reflect the
actual data. Outliers can be removed outright, truncated, or left as is, depending on their source and
the business implications of including them or not.
Note that outliers may be univariate, while others are multivariate and require looking over many
dimensions. For an example of a multivariate outlier, consider a dataset of human ages and heights.
A 4-year-old human wouldn't be strange, a 5-foot-taIl person isn't odd, but a 4-year-old that’s 5-feet
tall would either be an outlier, data entry error, or a Guinness world record holder.

Step 6: Feature Engineering


Feature engineering is the art of presenting data to machine learning models in the best way
possible. One part of feature engineering is feature selection: the process of selecting a relevant
subset of features for model construction, based on domain knowledge. The other is feature
preprocessing: transforming your data in a way that allows an algorithm to learn the underlying
structure of the data without having to sift through noisy inputs. Careful feature selection and
feature processing can boost model performance and let you get away with using simpler models.
The specific workflows for feature engineering depend on the type of data involved.
For quantitative data, several common operations are performed:
 Transformations: applying a function (like log, capping, or flooring) can help when the data is
skewed or when you want to make the data conform to more standard statistical distributions (a
requirement for certain models).
 Binning: also known as discretization or bucketing, this process breaks down a continuous
variable into discrete bins, enabling a reduction of noise associated with the data.
 Dimensionality Reduction: to generate a reduced set of uncorrelated features, you can use a
technique like PCA.
Also, standardize and scale data as needed: this is especially important for machine learning
algorithms that may be sensitive to variance in the feature values. For example, K-means uses
Euclidean distance to measure distances between points and clusters, so all features need
comparable variances. To normalize data, you can use min/max scaling, so that all data lies between
zero and one. To standardize features, you can use z-scores, so that data has a mean of zero and a
variance of one.
For categorical data, two common approaches are:
 One-hot encoding: turns each category into a vector of all zeroes, with the exception of a "one"
for the category at hand
 Hashing: turns the data into a fixed dimensional vector using a hashing function; great for when
features have a very high cardinality (large range of values) and the output vector of one-hot
encoding would be too big
While you might not be asked about NLP during most interviews, if it's listed on your resume, it's
good to know a few text preprocessing techniques as well:

Ace the Data Science Interview


DATA SCIENCE INTERVIEW I HIJO SINGH

 Stemming: reduces a word down to a root word by deleting characters (for example, turning the
words "liked" and ' 'likes" into "like").
 Lemmatization: somewhat similar to stemming, but instead of just reducing words into roots, it
takes into account the context and meaning of the word (for example, it would turn "caring" to
"care," whereas stemming would turn "caring" to "car").
 Filtering: removes "stop words" that don't add value to a sentence like "the" and "a", along with
removing punctuation.
 Bag-of-words: represents text as a collection of words by associating each word and its
frequency.
 N-grams: an extension of bag-of-words where we use N words in a sequence.
 Word embeddings: a representation that converts words to vectors that encode the meaning of
the word, where words that are closer in meaning are closer in vector space (popular methods
include word2vec and GloVe).

Step 7: Model Selection


Given the business constraints, evaluation metrics, and data sources available, what types of models
make the most sense to try? Factors to consider when selecting a model include:
 Training & Prediction Speed: for example, linear regression is much quicker than neural
networks for the same amount of data
 Budget: neural networks, for instance, can be computationally intensive models to train
 Volume & Dimensionality of Data: for example, neural networks can handle large amounts of
data and higher-dimensional data versus k-NN)
 Categorical vs. Numerical Features: for example, linear regression cannot handle categorical
variables directly, as they need to be one-hot encoded, versus trees (which can generally handle
them directly)
 Explainability: choosing interpretable models like linear regression may be favorable to "black
box" neural networks due to regulatory concerns or their ease of debugging Below is a quick
cheat sheet to also consider:

Ace the Data Science Interview


I
ACE THE DATA SCIENCE INTERVIEW HIJO & SINGH

Step 8: Model Training & Evaluation


So you picked a type model, or at least narrowed it down to a few candidates. At this point,
interviewers will expect you to talk about how to train the model. This is a good time to mention the
train-validation-test split) cross-validation, and hyper-parameter tuning.
You might also be asked how to compare your model to other models, hew to learn its parameters,
and how you’d know if its performance is good enough to ship. They might ask about edge cases, like
handling biased training data, how you'd assess feature importance. or dealing with data
imbalances.
The primary ways to deal with these issues, like picking the right model evaluation metric, using a
validation set, regularizing your models to avoid overfitting, and using learning curves to find
performance plateaus, are explained in-depth earlier in the chapter.
On the flip side, if you are dealing with so much data that your models are taking forever to train,
then it is worth looking into various sampling techniques. The most common ones in practice are
random sampling (sampling with or without replacement) and stratified sampling (sampling from
specific groups among the entire population). In addition, for imbalanced datasets, undersampling
and oversampling techniques are frequently used (SMOTE, for example). It is important to make sure
you understand the sampling method at hand because a wrong sampling Strategy may lead to
incorrect results.

Step 9: Deployment
So, now that you've picked out a model, how do you deploy it? The process of operationalizing the
entire deployment process is referred to as "MLOps" when DevOps meets ML. Two popular tools in
this space are Airflow and MLFlow. Because frameworks come and go, and many large tech
companies use their own internal versions of these tools, it's rare to be asked about these specific
technologies in interviews. However, knowledge of high-level deployment concepts is still helpful.
Generally, systems are deployed online, in batch, or as a hybrid of the two approaches. Online means
latency is critical, and thus model predictions are made in real time. Since model predictions
generally need to be served in real times there will typically be a caching layer of cached features.
Downsides for online deployment are that it can be computationally intensive to meet latency
requirements and requires robust infrastructure monitoring and redundancy.
Batch means predictions are generated periodically and is helpful for cases where you don't need
immediate results (most recommendation systems, for example) or require high throughput. But the
downside is that batch predictions may not be available for new data (for example, a
recommendation list cannot be updated until the next batch is computed). Ideally, you can work
with stakeholders to find the sweet spot. where a batch predictor is updated frequently enough to
be "good enough" to solve the problem at hand.
One deployment issue worth bringing up, that's common to both batch and online ML systems, is
model degradation. Models degrade because the underlying distributions of data for your model
change. For a concrete example, suppose you were working for a clothing e-commerce site and were
training a product recommendation model in the winter time. Come summer; you might accidentally
be recommending Canada Goose jackets in July not a very relevant product suggestion to anyone
besides Drake.
This feature drift leads to the phenomenon known the training-serving skew, where there's a
performance hit between the model in training and evaluation time versus when the model is served

Ace the Data Science Interview


ACE THE DATA INTERVIEW I HIJO & SINGH

in production. To show awareness for the training-serving skew issue, be sure to mention to
your

Ace Data Science Inferview


ACE THE DATA INTERVIEW I HIJO & SINGH

Ace Data Science Inferview


ACE THE DATA INTERVIEW I HIJO & SINGH

7.27. Two Sigma: Describe the kernel trick in SVMs and give a simple example. How do you decide
what kernel to choose?
7.28. Morgan Stanley: Say we have N observations for some variable which we model as being drawn
from a Gaussian distribution. What are your best guesses for the parameters of the
distribution?
7.29. Stripe: Say we are using a Gaussian mixture model (GMM) for anomaly detection of fraudulent
transactions to classify incoming transactions into K classes. Describe the model setup
formulaically and how to evaluate the posterior probabilities and log likelihood. How can we
determine if a new transaction should be deemed fraudulent?
7.30. Robinhood: Walk me through bow you'd build a model to predict whether a particular
Robinhood user will churn?
7.31. Two Sigma: Suppose you are running a linear regression and model the error terms as being
normally distributed. Show that in this setup, maximizing the likelihood of the data is equivalent
to minimizing the sum of the squared residuals.
7.32. Uber: Describe the idea behind Principle Components Analysis (PCA) and describe its
formulation and derivation in matrix form. Next, go through the procedural description and
solve the constrained maximization.
7.33. Citadel: Describe the model formulation behind logistic regression. How do you maximize the
log-likelihood of a given model (using the two-class case)?
7.34. Spotify: How would you approach creating a music recommendation algorithm for Discover
Weekly (a 30-song weekly playlist personalized to an individual user)?
7.35. Google: Derive the variance-covariance matrix of the least squares parameter estimates in
matrix form.

Machine Learning Solutions


Solution #7.1
Unbalanced classes can be dealt with in several ways.
First. you want to check whether you can get more data or not. While in many scenarios, data may
be expensive or difficult to acquire, it's important to not overlook this approach, and at least
mention it to your interviewer.
Next, make sure you're looking at appropriate performance metrics. For example, accuracy is not a
correct metric to use when classes are imbalanced — instead, you want to look at precision, recall,
F1 score, and the ROC curve.
Then, you can resample the training set by either oversampling the rare samples or undersampling
the abundant samples; both can be accomplished via bootstrapping. These approaches are easy and
quick to run, so they should be good starting points. Notes if the event is inherently rare, then
oversampling may not be necessary, and you should focus more on the evaluation function.
Additionally, you could generating synthetic examples. There are several algorithms for doing so —
the most popular is called SMOTE (synthetic minority oversampling technique), which creates
synthetic samples Of the rare class rather than pure copies by selecting various instances. It does
this by modifying the atributes slightly by a random amount proportional to the difference in
neighboring instances.

Ace Data Science Inferview


ACE THE DATA INTERVIEW I HIJO & SINGH

Another way is to resample classes by running ensemble models with different ratios of the classes,
or by running an ensemble model using all samples of the rare class and a differing amount of the
abundant class. Note that some models, such as logistic regression, are able to handle unbalanced
classes relatively well in a standalone manner. You can also adjust the probability threshold to
something besides 0.5 for classifying the unbalanced outcome.
Lastly, you can design your own cost function that penalizes wrong classification of the rare class
more than wrong classifications of the abundant class, This is useful if you have to use a particular
kind of model and you're unable to resample. However, it can be complex to set up the penalty
matrix, especially with many classes.

Solution #7.2
We can denote squared error as MSE and absolute error as MAE. Both are measures of distances
between vectors and express average model prediction in units of the target variable. Both can range
from 0 to infinity; the lower the score, the better the model.
The main difference is that errors are squared before being averaged in MSE, meaning there is a
relatively high weight given to large errors. Therefore, MSE is useful when large errors in the model
are trying to be avoided, This means that outliers disproportionately affect MSE more than MAE —
meaning that MAE is more robust to outliers. Computation-wise, MSE is easier to use, since the
gradient calculation is more straightforward than that of MAE, which requires linear programming to
compute the gradient.
Therefore, if the model needs to be computationally easier to train or doesn't need to be robust to
outliers, then MSE should be used. Otherwise, MAE is the better option, Lastly, MSE corresponds to
maximizing the likelihood of Gaussian random variables, and MAE does not. MSE is minimized by the
conditional mean, whereas MAE is minimized by the conditional median.

Solution #7.3
The elbow method is the most well-known method for choosing k in It-means clustering. The
intuition behind this technique is that the first few clusters will explain a lot of the variation in the
data, but past a certain point, the amount of information added is diminishing. Looking at a graph of
explained variation (on the y-axis) versus the number of clusters (k), there should be a sharp change
in the y-axis at some level of k. For example, in the graph that follows, we see a drop off at
approximately k=6.
Note that the explained variation is quantified by the within-cluster sum of squared errors. To
calculate this error metric, we look at, for each cluster, the total sum of squared errors (using
Euclidean distance). A caveat to keep in mind: the assumption of a drop in variation may not
necessarily be true — the y-axis may be continuously decreasing slowly (i.e., there is no significant
drop).
Another popular alternative to determining k in k-means clustering is to apply the silhouette
method, which aims to measure how similar points are in its cluster compared to other clusters.
Concretely, it looks at:
(x− y)
max ( x , y )
where x is the mean distance to the examples of the nearest cluster, and y is the mean distance to
other examples in the same cluster. The coefficient varies between -1 and 1 for any given point. A

Ace Data Science Inferview


ACE THE DATA INTERVIEW I HIJO & SINGH

value of 1 implies that the point is in the ''right" cluster and vice versa for a score of -1. By plotting
the

score on the y-axis versus k, we can get an idea for the optimal number of clusters based on this
metric. Note that the metric used in the silhouette method is more computationally intensive to
calculate for all points versus the elbow method.
Elbow Method for Optimal k

k
Taking a step back, while both the elbow and silhouette methods serve their purpose, sometimes it
helps to lean on your business intuition when choosing the number of clusters. For example, if you
are clustering patients or customer groups, stakeholders and subject matter experts should have a
hunch concerning how many groups they expect to see in the data. Additionally, you can visualize
the features for the different groups and assess whether they are indeed behaving similarly. There is
no perfect method for picking k, because if there were, it would be a supervised problem and not an
unsupervised one.

Solution #7.4
Investigating outliers is often the first step in understanding how to treat them. Once you understand
the nature of why the outliers occurred, there are several possible methods we can use:
 Add regularization: reduces variance, for example L1 or L2 regularization.
 Try different models: can use a model that is more robust to outliers. For example, tree-based
models (random forests, gradient boosting) are generally less affected by outliers than linear
models.
 Winsorize data: cap the data at various arbitrary thresholds. For example, at a 90% winsorization,
we can take the top and bottom 5% of values and set them to the 95th and 5th percentile of
values, respectively.
 Transform data: for example, do a log transformation when the response variable follows an
exponential distribution or is right skewed.

Ace Data Science Inferview


ACE THE DATA INTERVIEW I HIJO & SINGH

 Change the error metric to be more robust: for example, for MSE, change it to MAE or Huber
loss.

Ace Data Science Inferview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

 Remove outliers: only do this if you're certain that the outliers are true anomalies not worth
incorporating into the model. This should be the last consideration, since dropping data means
losing information on the variability in the data.

Solution #7.5
There will be two primary problems when running a regression if several of the predictor variables
are correlated. The first is that the coefficient estimates and signs will vary dramatically, depending
on what particular variables you included in the model. Certain coefficients may even have
confidence intervals that include 0 (meaning it is difficult to tell whether an increase in that X value is
associated with an increase or decrease in Y or not), and hence results will not be statistically
significant. The second is that the resulting p-values will be misleading. For instance, an important
variable might have a high p-value and so be deemed as statistically insignificant even though it is
actually important. It is as if the effect of the correlated features were "split" between them, leading
to uncertainty about which features are actually relevant to the model.
You can deal with this problem by either removing or combining the correlated predictors. To
effectively remove one of the predictors, it is best to understand the causes of the correlation (i.e.,
did you include extraneous predictors such as X and 2X or are there some latent variables underlying
one or more of the ones you have included that affect both? To combine predictors, it is possible to
include interaction terms (the product of the two that are correlated). Additionally, you could also (l)
center the data and (2) try to obtain a larger size of sample, thereby giving you narrower confidence
intervals. Lastly, you can apply regularization methods (such as in ridge regression).

Solution #7.6
Random forests are used since individual decision trees are usually prone to overfitting. Not only can
these utilize multiple decision trees and then average their decisions, but they can be used for either
classification or regression. There are a few main ways in which they allow for stronger out-of-
sample prediction than do individual decision trees.
 As in other ensemble models, using a large set of trees created in a resample of the data
(bootstrap aggregation) will lead to a model yielding more consistent results. More specifically,
and in contrast to decision trees, it leads to diversity in training data for each tree and so
contributes to better results in terms of bias-variance trade-off (particularly with respect to
variance).
 Using only m < p features at each split helps to de-correlate the decision trees, thereby avoiding
having very important features always appearing at the first splits of the trees (which would
happen on standalone trees due to the nature of information gain).
 They're fairly easy to implement and fast to run.
 They can produce very interpretable feature-importance values, thereby improving model
understandability and feature selection.
The first two bullet points are the main ways random forests improve upon single decision trees.

Solution #7.7
Step l: Clarify the Missing Data
Since these types of problems are generally context dependent, it's best 10 start your answer With
clarifying questions. For example,

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

 Is the amount of missing values uniform by feature?


 Ale these missing values numerical or categorical?
 How many features with missing data are there?
 Is there a patter-n in the types of transactions that have a lot of missing data?

It would also be useful to think about why the data is missing, because this affects how you'd impute
the data. Missing data is commonly classified as:
 Missing completely at random (MCAR): the probability of being missing is the same for all
classes
 Missing at random (MAR): the probability of being missing is the same within groups defined by
the observed data
 Not missing at random (NMAR): if the data is not MCAR and not MAR

Step 2: Establish a Baseline


One reason to ask these questions is because a good answer would consider that the missing data
may not actually be a problem. What if the missing data was in transactions that were almost never
fraud? What if the missing data is mostly in columns whose data features don't have good predictive
value? For example, if you were missing the IP-address derived geolocation of the person making the
payment, that would likely be bad for model performance. On the other hand, if you were missing
the user's middle name, it likely wouldn't have any bearing on whether the transaction was fraud or
not. Even simpler yet, can a baseline model be built that meets the business goals, without having to
deal with any missing data?

Step 3: Impute Missing Data


If the baseline model indicates that dealing with the missing data is worth it, one technique we could
use is imputation. For continuous features, we can start by using the mean or median for missing
values within any feature. However, the downside to this approach is that it does not factor in any of
the other features and correlations between them — it is unlikely that two transactions in differing
locations for different category codes would have the same transaction price. An alternative could be
to use a nearest neighbors method to estimate any given feature based on the other features
available.

Step 4: Check Performance with Imputed Data


With these modeled features, we can then run a set of classification algorithms to predict fraud or
not fraud. With this imputation technique, we can also cross-validate to check whether performance
increases by including the imputed data, relative to just the original data. Note that a performance
increase is only expected if the feature contains valuable information (for those rows/entries that
have it). If this isn't the case, and you won't see a significant impact as a result, it may be easiest to
drop the existing missing data before training the model.

Step 5: Other Approaches for Missing Data


Finally, thinking outside the box, is it possible to use a third-party dataset to fill in some of the
missing information? Suppose a common missing piece of information was the type of business that
a person paid. But, let's say we have the business's address on file. Could we use the address against
a business listings dataset to infer the type of business the transaction was made at? Lastly, note that
some models are capable of dealing with missing data, without requiring imputation.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #7.8
There are several possible ways to improve the performance of a logistic regression:
 Normalizing features: The features should be normalized such that particular weights do not
dominate within the model.
 Adding additional features: Depending on the problem, it may simply be the case that there
aren't enough useful features. In general, logistic regression is high bias, so adding more features
should be helpful.
 Addressing outliers: Identify and decide whether to retain or remove them.
 Selecting variables: See if any features have introduced too much noise into the process.
 Cross validation and hyperparameter tuning: Using k-fold cross validation along with
hyperparameter tuning (for example, introducing a penalty term for regularization purposes)
should help improve the model.
 The classes may not be linearly separable (logistic regression produces linear decision
boundaries), and, therefore, it would be worth looking into SVMs, tree-based approaches, or
neural networks instead.

Solution #7.9
For regular regression, recall we have the following for our least squares estimator:
T −1 T
β=(X X ) X y

So if we double the data, then we are using: ( XX ) ,(YY )


instead of and respectively. Then plugging this into our estimator from above, we get:

(( ) ( )) ( ) ( )
T −1 T
β= X X X X
X X X X
Simplifying yields:
T −1 T
β=(2 X X ) 2 X y

Therefore, we see that the coefficient remains unchanged.

Solution #7.10
In both gradient boosting and random forests, an ensemble of decision trees are used. Additionally,
both are flexible models and don't need much data preprocessing.
However, there are two main differences. The first main difference is that, in gradient boosting, trees
are built one at a time, such that successive weak learners learn from the mistakes of preceding
weak learners. In random forests, the trees are built independently at the same time.
The second difference is in the output: gradient boosting combines the results of the weak learners
with each successive iteration, whereas, in random forests, the trees are combined at the end
(through either averaging or majority).
Because of these structural differences, gradient boosting is often more prone to overfitting than are
random forests due to their focus on mistakes over training iterations and the lack of independence
in tree building. Additionally, gradient boosting hyperparameters are harder to tune than those of

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

random forests. Lastly, gradient boosting may take longer to train than random forests because the
trees of the latter are built sequentially. In real-life applications, gradient boosting generally excels
when used on unbalanced datasets (fraud detection, for example), whereas random forests
generally excel at multi-class object detection with noisy data (computer vision, for example).

Solution #7.11
Because "accurate enough" is subjective, it's best to ask the interviewer clarifying questions before
addressing the lack of training data, To stand out, you can also proactively mention ways to source
more training data at the end of your answer.

Step 1: Clarify What "Good" ETA Means


To determine how accurate the ETA model needs to be, first ask what the ETA prediction will be used
for. The level of accuracy needed in ETA predictions might be higher for the order-driver matching
algorithm than what DoorDash needs to display to the customer in the app. Also, consider if your
ETA estimate under-promises and over-delivers. Maybe that's okay — customers would likely be
happy that the delivery arrived faster than expected. At the same time, high ETA estimates across
the board may lead people to say, "Screw it, I'll just go to the store and pick it up myself." By
considering the context around how the ETA predictions will be used, you'll be one step closer to
understanding what a good-enough ETA model looks like.
One data-driven approach to establishing how accurate your ETA model needs to be involves looking
at ETA models in similar markets. From data in other locations, we could better understand the
economic impact of both over and underestimated ETAs. Tying the model output to its business
impact can help DoorDash decide if investing money into solving this problem is even warranted in
the first place.

Step 2: Assess Baseline ETA Performance


After you understand what "good-enough" ETA means, it's best practice to next see how a baseline
model, trained on the beta 10,000 deliveries made, performs. A baseline model can be something as
simple as the estimated driving time plus the average preparation time (conditional on the
restaurant and time of day). Since predicting the ETAs is a regression problem, potential metrics we
can use to assess this baseline model include root mean square error (RMSE), MAE, R2, etc.

Step 3: Determine How More Data Improves Accuracy


Say that we use R 2 as the main metric. One way to gauge the benefit from having more training data
would be to build learning curves. A learning curve depicts how the accuracy changes when we train
a model on a progressively larger percentage of the data. For example, say that with 25% of the data,
we get an R2 of 0.5, with 50% of the data we get an R 2 of 0.65, and with 75% of the data we get an R 2
of 0.67. Note that the improvement drops off significantly between the use of 50% and 75% of the
training data. The point at which the drop-off starts to become a problem is the signal that we
should look into reevaluating the features rather than simply adding more training data.
This process is analogous to looking at the learning curves discussed earlier in this chapter, which
look at how the training error and validation error change over the number of iterations — here,
instead, we are looking at how the model performance changes over the amount of training data
used.

Step 4: In Case Performance Isn't "Good Enough"

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

If after learning curves you realize that you don't have sufficient data to build an accurate enough
model, the interviewer would likely shift the discussion to dealing with this lack of data. Or, if you are
feeling like an overachiever with a can-do attitude, you could proactively bring up these discussion
points:
 Are there too few features? If so, you want to look into adding features like marketplace supply
and demand indicators, traffic patterns on the road at the time of the delivery, etc.
 Are there too many features? If there are almost as many or more features than data points,
then our model will be prone to overfitting and we should look into either dimensionality
reduction or feature selection techniques.
 Can different models be used that handle smaller training datasets better?
 Is it possible to acquire more data in a cost-effective way?
 Is the less accurate ETA model a true launch blocker? If we launched in the new market, which
generates more training data, can the ETA model be retrained?

Solution #7.12
Without looking at features, we could look at partial dependence plots (also called response curves)
to assess how any one feature affects the model’s decision. A partial dependence plot shows the
marginal of a feature on the predicted target of a machine learning model. So, after the model is fit,
we can take all the features and start plotting them individually against the loan approval/ rejection,
while keeping all the other features constant.

For example, if we believe that FICO score has a strong relationship to the predicted probability of
loan rejection, then we can plot the loan approvals and rejections as we adjust the FICO score from
low to higher. Thus, we can get an idea of how features impact the model without explicitly looking
at feature weights, and supply reasons for rejection accordingly.
As a concrete example, consider having four applicants: 1, 2, 3, and 4. Assume that the features
include annual income, current debt, number of credit cards, and FICO score. Suppose we have the
following situation:
1. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 700.
2. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 720.
3. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 600.
4. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 650.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

If 3 and 4 were rejected but 1 and 2 were accepted, then we can intuitively reason that a lower FICO
score was the reason the model made the rejections. This is because the remaining features are
equal, so the model chose to reject 3 and 4 "all-else-equal" versus 1 and 2.

Solution #7.13
To find synonyms, we can first find word embeddings through a corpus of words. Word2vec is a
popular algorithm for doing so. It produces vectors for words based on the words' contexts. Vectors
that are closer in Euclidean distance are meant to represent words that are also closer in context and
meaning. The word embeddings that are thus generated are weights on the resulting vectors. The
distance between these vectors can be used to measure similarity, for example, via cosine similarity
or some other similar measure.
Once we have these word embeddings, we can then run an algorithm such as K-means clustering to
identify clusters within word embeddings or run a K-nearest neighbor algorithm to find a particular
word for which we want to find synonyms. However, some edge cases exist, since word2vec can
produce similar vectors even in the case of antonyms; consider the words "hot" and "cold," for
example, which have opposite meanings but appear in many similar contexts (related to
temperature or in a Katy Perry song).

Solution #7.14
The bias-variance trade-off is expressed as the following: Total model error = Bias + Variance +
Irreducible error. Flexible models tend to have low bias and high variance, whereas more rigid
models have high bias and low variance. The bias term comes from the error that occurs when a
model underfits data. Having a high bias means that the machine learning model is too simple and
may not adequately capture the relationship between the features and the target. An example
would be using linear regression when the underlying relationship is nonlinear.
The variance term represents the error that occurs when a model overfits data. Having a high
variance means that a model is susceptible to changes in training data, meaning that it is capturing
and so reacting to too much noise. An example would be using a very complex neural network when
the true underlying relationship between the features and the target is simply a linear one.
The irreducible term is the error that cannot be addressed directly by the model, such as from noise
in data measurement.
When creating a machine learning model, we want both bias and variance to be low, because we
Want to be able to have a model that predicts well but that also doesn't change much when it is fed
new data. The key point here is to prevent overfitting and, at the same time, to attempt to retain
sufficient accuracy.

Solution #7.15
Cross validation is a technique used to assess the performance of an algorithm in several resamples/
subsamples of training data. It consists of running the algorithm on subsamples of the training data,
such as the original data less some of the observations comprising the training data, and evaluating
model performance on the portion of the data that was not present in the subsample. This process is
repeated many times for the subsamples, and the results are combined at the end. This step is very
important in ML because it reveals the quality and consistency of the model's true performance.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

The process is as follows:


1. Randomly shuffle data into 𝓀 equally-sized blocks (folds).
2. For each i in fold 1...𝓀, train the model on all the data except for fold i , and evaluate the

3. Average the 𝓀 validation errors from step 2 to get an estimate of the true error.
validation error using block i.

This process aids in accomplishing the following: (1) avoiding training and testing on the same
subsets of data points, which would lead to overfitting, and (2) avoiding using a dedicated validation
set, with which no training can be done. The second of these points is particularly important in cases

of this process, however, is that roughly 𝓀 times more computation is needed than using a
where very little training data is available or the data collection process is expensive. One drawback

dedicated holdout validation set. In practice, cross validation works very well for smaller datasets.

Solution #7.16
Step 1: Clarify Lead Scoring Requirements
Lead scoring is the process of assigning numerical scores for any leads (potential customers) in a
business. Lead scores can be based on a variety of attributes, and help sales and marketing teams
prioritize leads to try and convert them to customers.
As always, it's smart to ask the interviewer clarifying questions. In this case, learning more about the
requirements for the lead scoring algorithm is critical. Questions to ask include:
 Are we building this for our own company's sales leads? Or, are we building an extensible version
as part of the Salesforce product?
 Are there any business requirements behind the lead scoring (i.e., does it need to be easy to
explain internally and/or externally)?
 Are we running this algorithm only on companies in our sales database (CRM), or looking at a
larger landscape of all companies?
For our solution, we'll assume the interviewer means we want to develop a leading scoring model to
be used internally — that means using the company's internal sales data to predict whether a
prospective company will purchase a Salesforce product.

Step 2: Explain the Features You'd Use


Some elements which should influence whether a prospective company turns into a customer:
 Firmographic Data: What type of company is this? Industry? Amount of revenue? Employee
count?
 Marketing Activity: Have they interacted with marketing materials, like clicking on links within
email marketing campaigns? Have employees from that company downloaded whitepapers, read
case studies, or clicked on ads? If so, how much activity has there been recently?
 Sales Activity: Has the prospective company interacted with sales? How many sales meetings
took place, and how recently did the last one take place?
 Deal Details: What products are being bought? Some might be harder to close than others. How
many scats (licenses) are being bought? What's the size of the deal? What's the contract length?

Ace the Data Science Interview 130


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

After selecting features, it is good to conduct the standard set of feature engineering best practices.
Note that the model will only be as good as the data and judgement in feature engineering applied
— in practice, many companies that predict lead scoring can face issues with missing data or lack of
relevant data.

Step 3: Explain Models You'd Use


Lead scoring can be done through building a binary classifier that predicts the probability of a lead
converting. In terms of model selection, logistic regression offers a straightforward solution with an
easily interpretable result: the resulting log-odds is a probability score for, in this case, purchasing a
particular item. However, it cannot capture complex interaction effects between different variables
and could also be numerically unstable under certain conditions (i.e., correlated covariates and a
relatively small user base).
An alternative to logistic regression would be to use a more complex model, such as a-neural
network or an SVM. Both are great for dealing with high-dimensional data and with capturing the
complex interactions that logistic regression cannot. However, unlike logistic regression, neither is
easy to explain, and both generally require a large amount of data to perform well.
A suitable compromise is tree-based models, such as random forests or XGBoost, which typically
perform well. With tree-based models, the features that have the highest influence on predictions
are readily perceived, a characteristic that could be very useful in this particular case.

Step 4: Model Deployment Nuances


Lastly, it is important to monitor for feature shifts and/or model degradations. As the product line
and customer base changes over time, models trained on old data may not be as relevant. For a
mature company like Salesforce, for example, it's very likely that companies signing up now aren't
exactly like the companies that signed up with Salesforce 5 or 10 years ago. That's why it's important
to monitor feature drift and continuously update the model.

Solution #7.17
Collaborative filtering would be a commonly used method for creating a music recommendation
algorithm. Such algorithms use data on what feedback users have provided on certain items (songs
in this case) in order to decide recommendations. For example, a well-known use case is for movie
recommendation on Netflix. However, there are several differences compared to the Netflix case:
 Feedback for music does not have a 1-to-5 rating scale as Netflix does for its movies.
 Music may be subject to repeated consumption; that is, people may watch a movie once or
twice but will listen to a song many times over.
 Music has a wider variety (i.e., niche music).
 The scale of music catalog items is much larger than movies (i.e., there are many more songs
than movies).
Therefore, a user-song matrix (or a user-artist matrix) would constitute the data for this issue, with
the rows of the dataset being users and the columns various songs. However, in considering the first
point above, since explicit ratings are lacking, we can employ a binary system to count the number of
times a song is streamed and store this count.
We can then proceed with matrix factorization. Say there are M songs and N users in the matrix,
which we will label R. Then, we want to solve: R = PQT
T
where user preferences are captured by the vectors: = T ❑ =q❑ pu

Ace the Data Science Interview 131


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Various methods can be used for this matrix factorization; a common one is alternating least squares
(ALS), and, since the scale of the data is large, this would likely be done through distributed
computing. Once the latent user and song vectors are discovered, then the above dot product will be
able to predict the-relevance of a particular song to a user. This process can be used directly for
recommendation at the user level, where we sort by relevance prediction on songs that the user has
not yet streamed. In addition, the vectors given above can be employed in such tasks as assessing
similarity between different users and different songs using a method such as kNN (K-nearest
neighbors).

Solution #7.18
Mathematically, a convex function f satisfies the following for any two points x and y in the domain
of f: f ( (1−a ) x +ay ) ≤ (1−a ) f ( x ) +af ( y ) , 0 ≤ a≤ 1
That is, the line segment from x to y lies above the function graph off for any points x and y.
Convexity matters because it has implications about the nature of minima in f. Stated more
specifically, any local minimum of f is also a global minimum.
Neural networks provide a significant example of non-convex problems in machine learning. At a
high is because neural networks are universal function approximators, meaning that they can (with a
sufficient number of neurons) approximate any function arbitrarily well. Because not all functions
are convex (convex functions cannot approximate non-convex ones), by definition, they must be
non-convex. In particular, the cost function for a neural network has a number of local minima; you
could interchange parameters of different nodes in various layers and still obtain exactly the Same
cost function output (all inputs/outputs the same, but with nodes swapped). Therefore, there is no
particular global minima, so neural networks cannot be convex.

Solution #7.19
Because information gain is based on entropy, we'll discuss entropy first.

The formula for entropy is Entropy = ∑ −P (Y =¿ k )log P(Y =k )¿


i=1

The equation above yields the amount of entropy present and shows exactly how homogeneous a
sample is (based on the attribute being split). Consider a case where k = 2. Let a and b be two
outputs/ labels that we are trying to classify. Given these values, the formula considers the
proportion of values in the sample that are a and the proportion that are b, with the sample being
split on a different attribute.
A completely homogeneous sample will have an entropy of 0. For instance, if a given attribute has
values a aid b, then the entropy of splitting on that given attribute would be
Entropy=−1∗log 2 ( 1 )−0∗log 2 (0)=0
whereas a completely split (50%-50%) would result in an entropy of 1. A lower entropy means a
more homogeneous sample.
Information gain is based on the decrease in entropy after splitting on an attribute.
IG ( X j ,Y ) =H ( Y )−H (Y ∨X j)
This concept is better explained with a simple numerical example. Consider the above case again
with k = 2. Let's say there are 5 instances of value a and 5 instances of value b. Then, we decide to
split on some attribute X. When X = 1, there are 5 a's and 1 b, whereas when X = 0, there are 4 b's
and 0 a's.

Ace the Data Science Interview 132


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Now, by splitting on X, we have two classes: X = 1 and X = 0. However, by splitting on this attribute,
we now have X = 1, which has 5 a’s and 1 b, while X = 0 has 4 b's and 0 a's.

Ace the Data Science Interview 133


CHAPTER 7 : MACHINE LEARNING

Entropy(After) = ( 104 )∗Entropy ( X =0) −( 106 )∗Entropy ( X =1)


The entropy value for X = 0 is 0, since the sample is homogeneous (all b's, no a's).
Entropy(X = 0) = 0

Entropy(X = 1) = − ( 16 )∗log ( 16 )−( 56 )∗log ( 56 )=.65


2 2

Plug these into the entropy(after) formula to obtain the following:

Entropy(After) = ( 104 )∗0−( 106 )∗.65=.39


Finally, we can go back to our original formula and obtain information gain: IG = 1 - .39 = .61
These results make intuitive sense, since, ideally, we want to split on an attribute that splits the
output perfectly. Therefore, we ideally want to split on something that is homogeneous with regards
to the output, and this something would thus have an entropy equal to 0.

Solution #7.20
In machine learning, L1 and L2 penalization are both regularization methods that prevent overfitting
by coercing the coefficients of a regression model towards zero. The difference between the two
methods is the form of the penalization applied to the loss function. For a regular regression model,
assume the loss function is given by L. Using L1 regularization, the least absolute shrinkage and
selection operator, or Lasso, adds the absolute value of the coefficients as a penalty term, whereas
ridge regression uses L2 regularization, that is, adding the squared magnitude of the coefficients as
the penalty term.
The loss function for the two are thus the following:

Loss ( L1 ) =L+ ¿ wi∨¿ 2


Loss ( L2 ) =L+ ¿ wi ∨¿
where the loss function L is the sum of errors squared, given by the following, where f(x) is the
model of interest — for example, a linear regression with p predictors:

( )
n n p 2
L=∑ ( y i−f ( xi ) ) =∑ y i−∑ ( x ij w j )
2
for linear regression
i=1 i=1 j=1

If we run gradient descent on the weights w, L 1 regularization forces any weight closer to 0,
irrespective of its magnitude, whereas with L 2 regularization, the rate at which the weight
approaches 0 becomes slower as the weight approaches 0. Because of this, L 1 is more likely to "zero"
out particular weights and hence completely remove certain features from the model, leading to
models of increased sparseness.

Solution #7.21
The gradient descent algorithm takes small steps in the direction of steepest descent to optimize a
particular objective function. The size of the "steps" the algorithm takes are proportional to the
negative gradient of the function at the current value of the parameter being sought. The stochastic
version of the algorithm, SGD, uses an approximation of the nonstochastic gradient descent
algorithm instead of the function's actual gradient. This estimate is done by using only one randomly
selected sample at each step to evaluate the derivative of the function, making this version of the

134 Ace the Data Science Interview I Machine Learning


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

algorithm much faster and more attractive for situations involving lots of data. SGD is also useful
when redundancy in the data is present (i.e., observations that are very similar).
Assume function f at some point x and at time t. Then, the gradient descent algorithm will update x
as follows until it reaches convergence:
x t +1=x 1−at ∇ f ( x 1 )
That is, we calculate the negative of the gradient of f and scale that by some constant and move in
that direction at the end of each iteration.
Since many loss functions are decomposable into the sum of individual functions, then the gradient
step can be broken down into addition of discrete, separate gradients. However, for very large
datasets, this process can be computationally intensive, and the algorithm might become stuck at
local minima or at saddle points.
Therefore, we use the stochastic gradient descent algorithm to obtain an unbiased estimate of the
true gradient without going through all data points by uniformly selecting a point at random and
performing a gradient update then and there.
The estimate is therefore unbiased since we have the following:
n
1
∇ f ( x )= ∑ ∇ f ( x )
n i=1
Since the data are assumed to be i.i.d., for the SGD, the expectation of g(x) is: E [ g ( x ) ] =∇ f ( x )
where g(x) is the stochastic gradient descent.

Solution #7.22
Recall that the ROC curve plots the true positive rate versus the false positive rate. If all scores
change simultaneously, then none of the actual classifications change (since thresholds are
adjusted): leading to the same true positive and false positive rates, since only the relative ordering
of the scores matters. Therefore, taking a square root would not cause any change to the ROC curve
because the relative ordering has been maintained. If one application had a score of X and another a
score of Y, and if Y > X, then √ Y > √ X still. Only the model thresholds would change.
In contrast, any function that is not monotonically increasing would change the ROC curve, since the
relative ordering would not be maintained. Some simple examples are the following:

f ( x )=−x , f ( x )=−x , or a stepwise function.


2

Solution #7.23
We have: X N ( μ , σ 2 ), and entropy for a continuous random variable is given by the following:

H ( x )=−∫ p ( x ) log p ( x ) dx
−∞
2
(x−μ)
For a Gaussian, we have the following: p ( x )= 1 e
2
2a
σ √2 π
Substituting into the above equation yields

( )
∞ ∞ 2
− ( x−μ )
H ( x )=−∫ p ( x ) log σ √ 2 π dx− ∫ p ( x ) 2
log ( e ) dx
−∞ −∞ 2a

Ace the Data Science Interview 135


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH


where the first term equals −log σ √ 2 π ∫ p ( x ) dx=−log σ √ 2 π
−∞

since the integral evaluates to 1 (by the definition of a probability density function). The second term
is given by:

∞ 2
1 σ 1
2∫
2
p ( x ) ( x−μ ) dx= 2 =
2 σ −∞ 2σ 2
since the inner term is the expression for the variance. The entropy is therefore as follows:
1
H ( x )= + logσ √2 π
2
Solution #7.24
The standard approach is (1) to construct a large dataset with the variable of interest (purchase or
not) and relevant covariates (age, gender, income, etc.) for a sample of platform users and (2) to
build a model to calculate the probability of purchase of each item. Propensity models are a form of
binary classifier, so any model that can accomplish this could be used to estimate a customer 's
propensity to buy the product.
In selecting a model, logistic regression offers a straightforward solution with an easily interpretable
result: the resulting log-odds is a probability score for, in this case, purchasing a particular item.
However, it cannot capture complex interaction effects between different variables and could also be
numerically unstable under certain conditions (i.e., correlated covariates and a relatively small user
base).
An alternative to logistic regression would be to use a more complex model, such as a neural
network or an SVM. Both are great with dealing with high-dimensional data and with capturing the
complex interactions that logistic regression cannot. However, unlike logistic regression, neither is
easy to explain, and both generally require a large amount of data to perform well.
A good compromise is tree-based models, such as random forests, which are typically highly
accurate and are easily understandable. With tree-based models, the features which have the
highest influence on predictions are readily perceived, a characteristic that could be very useful in
this particular case.

Solution #7.25
Both Gaussian naive Bayes (GNB) and logistic regression can be used for classification. The two
models each have advantages and disadvantages, which provide the answer as to which to choose
under what circumstances. These are discussed below, along with their similarities and differences:
Advantages:
1. GNB requires only a small number of observations to be adequately trained; it is also easy to use
and reasonably fast to implement; interpretation of the results produced by GNB can also be
highly useful.
2. Logistic regression has a simple interpretation in terms of class probabilities, and it allows
inferences to be made about features (i.e., variables) and identification of the most relevant of
these with respect to prediction.
Disadvantages:

Ace the Data Science Interview 136


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

1. By assuming features (i.e., variables) to be independent, GNB can be wrongly employed in


problems where that does not hold true, a very common occurrence.
2. Not highly flexible, logistic regression may fail to capture interactions between features and so
may lose prediction power. This lack of flexibility can also lead to overfitting if very little data are
available for training.
Differences:
1. Since logistic regression directly learns P(YIX), it is a discriminative classifier, whereas GNB
directly estimates P(Y) and P(XIY) and so is a generative classifier.

Ace the Data Science Interview 137


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

2. Logistic regression requires an optimization setup (where weights cannot be learned directly
through counts), whereas GNB requires no such setup.
Similarities:
1. Both methods are linear decision functions generated from training data.
2. GNB's implied P(YIX) is the same as that of logistic regression (but With particular parameters).
Given these advantages and disadvantages, logistic regression would be preferable assuming training
provided data size is not an issue, since the assumption of conditional independence breaks down if
features are correlated. However, in cases where training data are limited or the data-generating
process includes strong priors, using GNB may be preferable.

Solution #7.26
Assume we have 𝓀 clusters and n sample points: x 1 … x n , μ 1 … μk
The loss function then consists of minimizing total error using a squared L2 norm (since it is a good
way to measure distance) for all points within a given cluster:
h
L=∑ ∑ ‖x i−μ j‖
2

j=1 xε S j

Taking the derivatives yields the following:


∂L ∂
= ∑ ( x −μ )T ( x i−μ k )= ∑ 2 ( x i−μk )
∂ μk ∂ μk x εS i k
i t xεS i k

For batch gradient descent, the update step is then given by the following:
μk =μ k +∈ ∑ 2 ( x i−μk )
xi ε Sk

However, for stochastic gradient descent, the update step is given by the following:

μk =μ k +∈ ( x t −μk )

Solution #7.27
The idea behind the kernel trick is that data that cannot be separated by a hyperplane in its current
dimensionality can actually be linearly separable by projecting it onto a higher dimensional space. 
T
k ( x , y )=( x ) ( y )
and we can take any data and map that data to a higher dimension through a variety of functions .
However, if  is difficult to compute, then we have a problem — instead, it is desirable if we can
compute the value of k without blowing up the computation.
For instance, say we have two examples and want to map them to a quadratic space. We have the
following:

[]
1
x1
x
( x 1 , x 2 )= 22
x1
2
x2
x1 x 2

Ace the Data Science 138


and we can use the following: k ( x , y )=(1+ x T y )2=( x )T ( y )

If we now change n = 2 (quadratic) to arbitrary n, we can have arbitrarily complex . As long as we


perform computations in the original feature space (without a feature transformation), then we
avoid the long compute time while still mapping our data to a higher dimension!
In terms of which kernel to choose, we can choose between linear and nonlinear kernels, and these
will be for linear and nonlinear problems, respectively. For linear problems, we can try a linear or
logistic kernel. For nonlinear problems, we can try either radial basis function (RBF) or Gaussian
kernels.
In real-life problems, domain knowledge can be handy — in the absence of such knowledge, the
above defaults are probably good starting points. We could also try many kernels, and set up a
hyperparameter search (a grid search, for example) and compare different kernels to one another.
Based on the loss function at hand. or certain performance metrics (accuracy, F1, AUC of the ROC
curve, etc.), we can determine which kernel is appropriate.

Solution #7.28
Assume we have some dataset X consisting of n i.i.d observations: x 1 , … , x n
Our likelihood function is then
n (x 1−μ)
1
p ( X|μ , σ ) =∏ N ( x i|μ , σ ) where N ( x i|μ , σ ) =
2 2 2 2
2a
e
i=1 σ √2 π
And therefore the log-likelihood is given by:
n n
1 n n
log p ( X|μ , σ ) =¿ ∑ log N ( x i|μ , σ ) = 2∑
2 2 2 2
(x i−μ) − log σ − log π ¿
i=1 2 σ i=1 2 2
Taking the derivative of the log-likelihood with respect to and setting the result to 0 yields the
following:
d log p ( X| μ , σ 2 ) 1 n
= 2 ∑ ( xi −μ)=0
dμ σ i=1
n n
Simplifying the result yields: ∑ x 1=∑ μ^ =n ^μ, and therefore the maximum likelihood estimate for
i=1 i =1

µ is given by:
n
1
^μ= ∑ x i
n i=1
To obtain the variance, we take the derivative of the log likelihood with respect to 2 and set the
result equal to 0
d log p ( X| μ , σ 2 ) 1
n
n

2
= 4 ∑
2 σ i=1
2
(x i−μ) − 2 =0

n
1 n
4∑
Simplifying yields the following: (x i−μ)2− 2
2 σ i=1 2σ
n

∑ (x i−μ)=n σ 2
i=1

The maximum likelihood estimate for the variance is thus given by the following:

Ace the Data Science Interview 139


n
1
σ^ 2= ∑
n i=1
( xi −^μ )2

Solution #7.29
The GMM model assumes a Gaussian probability distribution function, across K classes:
k
p ( x )=∑ π k N x μ k , ∑
k=1 (| k )
where the coefficients are the mixing coefficients on the clusters and are normalized so that they
sum to 1.
The posterior probabilities for each cluster are given by Bayes' rule and can be interpreted as "what
is the probability of being in class k given the data x":
p(k ) p ( k|x )
z k = p ( k|x ) =
p( x)

and hence: z k = p ( k|x )=


(|
π k N x μk ∑
k )
k

∑ π k N ( x|μ k ∑ )
k=1 k

The unknown set of parameters  consists of the mean and variance parameters for each of the K
classes, along with the K coefficients. The likelihood is therefore given by:
n n k
p ( θ|x )=∏ p (x)=∏ ∑ π k N x μk , ∑
i=1 i=1 k =1 (| k )
n k
and therefore the log-likelihood is log p ( θ| x ) =∑ log ∑ π k N x μ k , ∑
i=1 k=1 (| k )
The parameters can be calculated iteratively using expectation-maximization and the information
above, After the model has been trained, for any new transaction we can then calculate the
posterior probabilities of any new transactions over the k classes as above. If the posterior
probabilities calculated are low, then the transaction most likely does not belong to any of the K
classes, so we can deem it to be fraudulent.

Solution #7.30
Step 1: Clarify What Churn Is & Why It's Important
First, it is important to clarify with your interviewer what churn means. Generally, the word "churn"
defines the process of a platform's loss of users over time.
To determine what qualifies as a churned user at Robin-hood, it's helpful to first follow the money
and understand how Robinhood monetizes. One primary way is by trading activity — whether it is
through their Robinhood Gold offering or order flow sold to market makers like Citadel. Thus, a
cancellation of their Robinhood Gold membership or a long period of no trading activity could
constitute churn. The other way Robinhood monetizes is through a user's account balance. By
collecting interest on uninvested cash and making stock loans to counterparties, Robinhood is

Ace the Data Science Interview 140


incentivized to have user's manage a large portfolio on the platform. As such, a negligible account
balance or portfolio maintained over a period of time — say a quarter — could constitute a churned
user.
Churn is a big deal, because even a small monthly churn can compound quickly over time: consider
that a 2% monthly churn translates to almost a 27% yearly churn. Since it is much more expensive to
acquire new customers than to retain existing ones, businesses with high churn rates will need to
continually dedicate more financial resources to support new customer acquisition, which is
costly,

Ace the Data Science Interview 141


CHAPTER MACHINE LEARNING

and therefore to be avoided if possible. So, if Robinhood is to stay ahead of WeBull, Coinbase, and TD
Ameritrade: predicting who will churn, and then helping these at-risk users, is beneficial.
After you've worked with your interviewer to clarify what churn is in this context, and why it's
important to mitigate, be sure to ask the obvious question: how is my model output going to be
used? If it’s not clear how the model will be used for the business, then even if the model has great
predictive power, it is not useful in practice.

Step 2: Modeling Considerations


Any classification algorithm could be used to model whether a particular customer would be in the
churned or active state. However, models that produce probabilities (e.g., logistic regression) would
be preferable if the interest is in the probability of the customer's loss rather than simply a final
prediction about whether the customer will be lost or not.
Another key consideration when picking a model in this instance would be model explainability. This
is because company representatives likely want to understand the main reasons for churn to support
marketing campaigns or customer support programs. In this case, interpretable models such as
logistic regression, decision trees, or random forests should be used. However, if by talking with the
interviewer you learn that it's okay to simply detect churn, and that explainability isn't required, then
less interpretable models like neural networks and SVMs can work.

Step 3: Features We'd Use to Model Churn


Some feature ideas include:
Raw Account Balance: Is the portfolio value close to the threshold where it doesn't make sense
to check the app anymore (say under $10)?
Account Balance Trend: Is there a negative trend — they used to have $20k in their account, but
have steadily been withdrawing money out of the account?
Experienced Heavy Losses: Maybe they recently lost a ton of money trading Dogecoin, making
them want to quit investing and rethink their trading strategies (and their life).
Recent Usage Patterns: Maybe they used to be a Daily Active User, but recently have started
logging in less and less — a sign that the app isn't as important any more.
User Demographics: Standard user profile variables like age, gender, and location can also be
used to model churn.
It’s also wise to collaborate with business users to see their perspectives and to look for basic
heuristics they might use that can be factored into the model, For example, maybe the customer
support team has some insights into signals that indicate a user will churn out.
After running the model, it is good to double-check the results to see if the feature importance
roughly matches what we would intuitively expect; for example, it is unlikely that a higher balance
would result in a higher likelihood of churn.

Step 4: Deploying the Churn Model


We want to make sure the various metrics of interest (confusion matrix, ROC curve, Ff scores etc.)
are satisfactory during offline training before deploying the model in production. As with any
prediction task, it is important to monitor model performance and adjust features as necessary
whenever there is new data or feedback from customer-facing teams. This helps prevent model
degradation, which is a common problem in real-world ML systems. We'd also continuously conduct
error analysis by looking at where the model is wrong, in order to keep refining the model. Finally,
we'd also make sure to A/B test our model to validate its impact.

Ace the Data Science Interview I Machine Learning


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #7.31
In matrix form, we assume Y is distributed as multivariate Gaussian: Y N ( Xβ , σ 2 I )
−n

The likelihood of Y given above is


L ( β , σ ) =( 2 σ π ) exp
2 2 2
( −1
2σ 2
( Xβ−Y )T ( Xβ−Y )
)
Of which we can take the log in order to optimize:
−n 1
log L ( β , σ ) = log ( 2 σ π ) − 2 ( Xβ−Y ) ( Xβ−Y )
2 2 T
2 2σ
Note that, when taking a derivative with respect to , the first term is a constant, so we can ignore it,
1
making our optimization problem as follows: arg max −¿ 2
( Xβ−Y )T ( Xβ−Y ) ¿
β 2σ
T
We can ignore the constant and flip the sign to rewrite as the following: arg min ( Xβ−Y ) ( Xβ−Y )
β

which is exactly equivalent to minimizing the sum of the squared residuals.

Solution #7.32
PCA aims to reconstruct data into a lower dimensional setting, and so it creates a small number of
linear combinations of a vector x (assume it to be p dimensional) to explain the variance within x.
More specifically, we want to find the vector w of weights such that we can define the following
linear combination:
p
y 1=wTi x=∑ wij x j
j=1

subject to the constraint that w is orthonormal and that the following is true:
y i is uncorrelated with y j, var ( y i ) is maximized
T
Hence, we perform the following procedure, in which we first find y 1=w1 x with maximal variance,
meaning that the scores are obtained by orthogonally projecting the data onto the first principal
T
direction, w1. We then find y 2=w2 x 1 is uncorrelated with y1 and has maximal variance, and we
continue this procedure iteratively until ending with the kth dimension such that
Y1,…yk explain the majority of variance, k << p
To solve, note that we have the following for the variance of each y, utilizing the covariance matrix of
T T
x: var ¿ )= w i var ( x ) wi =wi ∑ wi
Without any constraints, we could choose arbitrary weights to maximize this variance, and hence we
T
will normalize by assuming orthonormality of w, which guarantees the following: w i w i=1
We now have a constrained maximization problem where we can use Lagrange multipliers.
Specifically, we have the function w i
T
∑ w i−❑i (w Ti w i−1 )=0, which we differentiate with respect
to w to solve the optimization problem:
d T
wi ∑ wi−❑i ( wi w i−1 ) =∑ w i−❑i ( wi ) =0
T
d wi

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Simplifying, we see that: ∑ wi=❑i( w ) i

This is the result of an eigen-decomposition, whereby w is the eigenvector of the covariance matrix
and  is this vector's associated eigenvalue. Noting that we want to maximize the variance for each y,
we pick: w i ∑ w i=w i ❑i wi=❑i w i w i=❑i to be as large as possible. Hence, -we choose the first
T T T

eigenvalue to be the first principal component, the second largest eigenvalue to be the second
principal component, and so on.

Solution #7.33
Logistic regression aims to classify X into one of k classes by calculating the following:
P(C=i∨X =x)
log =β + β T x
P(C=K∨ X=x ) 10 1
Therefore, the model is equivalent to the following, where the denominator normalizes the
numerator over the k classes:

e
P ( C=k|X =x )= K



e

The log-likelihood over N observations in general is the following:


n
L ( β| X ,C )=∑ log P ( C=k|X=x i , β )Use the following notation to denote classes 1 and 2 for the
i=1
two-class case:
y i=1 if the class is 1, otherwise 0
Then we have the following: P ( C=2|X =xi , θ ) =1−P ( C=1|X =xi , θ )
Using the following notation, P ( C=1|X =xi , θ ) =p ( x ¿¿ i)¿ such that the log-likelihood can be
written as follows:
n
L ( β ) =∑ ¿ ¿
i=1

Simplifying yields the following:


n n
p ( xi )
L ( β ) =∑ log (1− p(¿ ¿ x i ))+ ∑ y i log ¿¿
i=1 i=1 1− p ( xi )
Substituting for the probabilities yields the following:
n n
L ( β ) =∑ log (1−e β n β1 xi
)+ ∑ y i (β 0 + β 1 x i)
i=1 i=1

To maximize this tog-likelihood, take the derivative and set it equal to 0


n
∂ L( β )
=∑ (x ¿ ¿ i(¿ y i− p ( x i ) ))=0 ¿ ¿
∂β i=1

We note that:

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

n n n βn β1 xi
∂ ∂
=∑ log (1−e β n β1 x i
)= ∑ −log ( 1+ e β n β1 x i
)=∑ −¿ e β β x ¿
∂ β i=1 ∂ β i=1 i=1 1+e n 1 i

which is equivalent to the latter half of the above expression:


n
−∑ p(x ¿¿ i)¿
i=1

The solutions to these equations are not closed form, however, and, hence, the above should be
iterated until convergence.

Solution #7.34
Step 1: Clarify Details of Discover Weekly
First we can ask some clarifying questions:
 What is the goal of the algorithm?
 Do we recommend just songs, or do we also include podcasts?
 Is our goal to recommend new music to a user, and push their musical boundaries? Or is it to just
give them the music they'll want to listen to the most, so they spend more time on Spotify? Said
more generally, how do we think about the trade-off of exploration versus exploitation?
 What are the various service-level agreements to consider (e.g., does this playlist need to change
week to week if the user doesn't listen to it?)
 Do new users get a Discover Weekly playlist?
Step 2: Describe What Data Features You'd Use
The core features will be user-song interactions. This is because users' behaviors and reactions to
various songs should be the strongest signal for whether or not they enjoy a song. This approach is
similar to the well-known use case for movie recommendations on Netflix, with several notable
differences:
 Feedback for music does not have a 1-to-5 rating scale as Netflix does for its movies.
 Music may be subject to repeated consumption (i.e., people may watch a movie once or twice
but will listen to a song many times).
 Music has a wider variety (i.e., niche music).
 The scale of music catalog items is much larger than movies (i.e., there are many more songs
than movies).
There is also a variety of other features outside of user-song interactions that could be interesting to
consider. For example, we have plenty of metadata about the song (the artist, the album, the
playlists that include that song) that could be factored in. Additionally, potential audio features in the
songs themselves (tempo, speechiness, instruments used) could be used. And finally, demographic
information (age, gender, location, etc.) can also impact music listening preferences — people living
in the same region are more likely to have similar tastes than someone living on the other side of the
globe.

Step 3: Explain Collaborative Filtering Model Setup


There are two types of recommendation systems in general: collaborative filtering (recommending
songs that similar users prefer) and content-based recommendation (recommending similar types of
songs). Our answer will use collaborative filtering.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Collaborative-filtering uses data from feedback users have provided on certain items (in this case,
songs) in order to decide recommendations. Therefore, a user-song matrix (or a user-artist matrix)
would constitute the dataset at hand, with the rows of the dataset being users and the columns
various songs. However, as discussed in the prior section, since explicit song ratings are lacking, we
can proxy liking a song by using the number of times a user streamed it. This song play count is
stored for every entry in the user-song matrix.

The output of collaborative filtering is a latent user matrix and a song matrix. Using vectors from
these matrices, a dot product denotes the relevance of a particular song to a particular user. This
process can be used directly for recommendation at the user level, where we sort by relevance
scores on songs that the user has not yet streamed. You can also use these vectors to assess
similarity between different users and different songs using a method such as kNN (K-nearest
neighbors).

Step 4: Additional Considerations


Also relevant to this discussion are the potential pros and cons of collaborative filtering for example,
one pro is that you can run it in a scalable manner to find correlations behind user-song interactions.
On the flip side, one con is the "cold start" problem, where an existing base of data is needed for any
given user.
Another important consideration is scale. Since Spotify has hundreds of millions of users, the
Discover Weekly algorithm could be updated in batch for various users at different times to help
speed up data processing and model training.
Another consideration is the dynamic nature of the problem; the influx of new users and songs,
along with fast-changing music trends, would necessitate constant retraining.
Lastly, it is important to consider how you can measure and track the impact of this system over time
— collaborative filtering doesn't come with clear metrics of performance. Ideally, you'd use an A/B
test to find that users with the improved recommendations had increased engagement on the
platform (as measured by time spent listening, for example).

Solution #7.35
We are attempting to solve for Var( ^β )
Recall that the parameter estimates have the following closed-form solution in matrix form:
^β=(X T X )−1 X T y
To derive the variance of the estimates, recall that for any given random variable X:
Var ( X )=E [ X 2 ]−E [ X 2 ]
2
Therefore, we have the following: Var ( ^β )=E ( β^ 2 ) −E [ ^β ]

We can evaluate the second term since we can assume the parameter estimates are unbiased.
Therefore, E ( ^β )= β
2 2
Var (β ̂ )=E(β ̂ )−β
Substituting into the closed-form solution yields the following: Var ( ^β )=E [(( X T X )−1 X T y ¿ ¿ 2 ¿−β 2

Since least squares assumes that: y= Xβ+ ϵ where ϵ N ( 0 , σ 2), we have the following:

Var ( ^β )=E ¿

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Var ( ^β )=E [ ( ( X X ) X ( Xβ ) + ( X X ) X ϵ ) ]−β


−1 T −1 T 2
T T 2

−1
Note that: ( X T X ) T
X =1
So simplifying yields: Var ( ^β )=E [ ( β+ ( X T X ) X ϵ ) ]−β
−1 2
T 2

Var ( ^β )=β + E[ ( ( X X ) X ϵ ) ]−β


−1 2
2 T T 2

where the middle terms were canceled since the expectation of the error term is 0. Canceling out
the first and last squared terms and simplifying the middle part yields the following:
Var ( ^β )=E [ ( X X ) X X ( X X ) ϵ ]=( X X ) E ( ϵ )= ( X X ) σ
T −1 T T −1 2 T −1 2 T −1 2

Ace the Data Science Interview


SQL & DB Design
CHAPTER 8

Upon hearing the term "data scientist, " buzzwords such as predictive analytics, big data,
and deep learning may leap to mind. So, let's not beat around the bush: data wrangling isn't
the most fun or sexy part of being a data scientist. However, as a data scientist, you will
likely spend a great deal of your time working writing SQL queries to retrieve and analyze
data. As such, almost every company you interview with will test your ability to write SQL
queries. These questions are practically guaranteed if you are interviewing for a data
scientist role on a product or analytics team, or if you're after a data science-adjacent role
like data analyst or business intelligence analyst. Sometimes, data science interviews may go
beyond just writing SQL queries, and cover the basic principles of database design and other
big data systems. This focus on data architecture is particularly true at early-stage startups,
where data scientists often take an active role in data engineering and data infrastructure
development.

SQL
How SQL Interview Questions Are Asked
Because most analytics workflows require quick slicing and dicing of data in SQL, interviewers will
often present you with hypothetical database tables and a business problem, and then ask you to

Ace the Data Science Interview 148


ACE THE DATA SCIENCE INTERVIEW I
write SQL on the spot get to an answer. This is an especially common early interview question,
conducted via a shared coding environment or through an automated remote assessment tool.
Because of the many different flavors of SQL used by industry, these questions aren't usually testing
your knowledge of database-specific syntax or obscure commands. Instead, interviews are designed
to test your ability to translate reporting requirements into SQL.

For example, at a company like Facebook, you might be given a table on user analytics and asked to
calculate the month-to-month retention. Here, it's relatively straightforward what the query should
be, and you're expected to write it. Some companies might make their SQL interview problems:
more open-ended. For example, Amazon might give you tables about products and purchases and
then ask you to list the most popular products in each category. Robinhood may give you a table and
ask why users are churning. Here, the tricky part might not be just writing the SQL query, but also
figuring out collaboratively with the interviewer what "popular products" or "user churn" means in
the first place.
Finally, some companies might ask you about the performance of your SQL query. While these
interview questions are rare, and they don't expect you to be a query optimization expert, knowing
how to structure a database for performance, and avoid slow running queries, can be helpful. This
knowledge can come in handy as well when you are asked more conceptual questions about
database design and SQL.

Tips for Solving SQL Interview Questions


First off, don't jump into SQL questions without fully understanding the problem. Before you start
whiteboarding or typing out a solution, it's crucial to repeat back the problem so you can be sure
you've understood it correctly. Next, try to work backwards, especially if the answer needs multiple
joins, subqueries, and common table expressions (CTEs). Don't overwhelm yourself trying to figure
out the multiple parts of the final query at the same time. Instead, imagine you had all the
information you needed in a single table, so that your query was just a Single SELECT statement.
Working backwards slowly from this ideal table, one SQL statement at a time, try to end up with the
tables you originally started with.
For more general problem-solving tips, be sure to also read the programming interview tips in the
coding chapter, Most of what applies to solving coding questions — like showing your work and
asking for help if stuck — applies to solving SQL interview questions too.

Basic SQL Commands


Before we cover the must-know SQL commands, a quick note — please don't be alarmed by minor
variations in syntax between your favorite query language and our PostgreSQL snippets:
 CREATE TABLE: Creates a table in a relational database and, depending on what database you use
(e.g., MySQL), can also be used to define the table's schema.
 INSERT: Inserts a row (or a set of rows) into a given table.
 UPDATE: Modifies already-existing data.
 DELETE: Removes a row (or a group of rows) from a database.
 SELECT: Selects certain columns from a table. A common part of most queries.
 GROUP BY: Groups/aggregates rows having the contents of a specific column or set of columns.
 WHERE: Provides a condition on which to filter before any grouping is applied.
 HAVING: Provides a condition on which to filter after any grouping is applied.

Ace the Data Science Interview 149


ACE THE DATA SCIENCE INTERVIEW I
 ORDER BY: Sorts results in ascending or descending order according to the contents of a specific
column or set of columns.

 DISTINCT: Returns only distinct values.


 UNION: Combines results from multiple SELECT statements.

Joins
Imagine you worked at Reddit, and had two separate tables: users and posts.

Reddit users Reddit Posts


Aa column_name type Aa column_name type
User_id integer Post_id integer
country string
Active_status boolean User_id integer
Join_time datetime Subreddit_id integer
title string
body string
Active_status boolean
Joins are used to combine rows from multiple tables post_time datetime
based on a common column. As you can see, the user_id column is the common column between
the two tables and links them; hence it is known as a join key, There are four main types of joins:

INNER JOIN OUTER JOIN LEFT JOIN RIGHT JOIN

INNER JOIN
Inner joins combine multiple tables and will preserve the rows where column values match in the
tables being combined. The word INNER is optional and is rarely used because it's the default type of
join. As an example, we use an inner join to find the number of Reddit users who have made a post:

SELECT
COUNT (DISTINCT user_id)
FROM
Users
JOIN posts p user_id p user_id

A self join is a special case of an inner join where a table joins itself. The most common use case for a
self join is to look at pairs of rows within the same table,

Ace the Data Science Interview 150


ACE THE DATA SCIENCE INTERVIEW I
OUTER JOIN
Outer joins combine multiple tables by matching on the columns provided, while preserving all rows.
As an example of an outer join, we list all inactive users with posts and all inactive posts from any
user:

LEFT JOIN
Left joins combine multiple tables by matching on the column names provided, while preserving all
the rows from the first table of the join. As an example, we use a left join to find the percentage of
users that made a post:

RIGHT JOIN
Right joins combine multiple tables by matching on the column names provided, while preserving all
the rows from the second table of the join. For example, we use the right join to find the percentage
of posts made where the user is located in the U.S.:

Ace the Data Science Interview 151


ACE THE DATA SCIENCE INTERVIEW I

Join Performance
Joins are an expensive operation to process, and are often bottlenecks in query runtimes. As such, to
write efficient SQL, you want to be working with the fewest rows and columns before joining two
tables together. Some general tips to improve join performance include the following:
 Select specific fields instead of using SELECT *
 Use LIMIT in your queries
 Filter and aggregate data before joining
 Avoid multiple joins in a single query

Advanced SQL Commands


Aggregation
For interviews, you need to know how to use the most common aggregation functions like COUNT,
SUM, AVG, or MAX:

Filtering
SQL contains various ways to compare rows, the most common which use = and <> (not equal),>,
and <, along with regex and other type of logical and filtering causes such as OR and AND. For
example, below we filter to active Reddit users from outside the U.S.:

Common Table Expressions and Subqueries


Common Table Expressions (CTEs) define a query and then allow it to be referenced later using an
alias. They provide a handy way of breaking up large queries into more manageable subsets of data.
For example, below is a CTE which gets the number of posts made by each user, which is then used
to get the distribution of posts made by users (i.e., 100 users posted 5 times, 80 users posted 6
times, and so on):

Ace the Data Science Interview 152


ACE THE DATA SCIENCE INTERVIEW I

Subqueries serve a similar function to CTEs, but are inline in the query itself and must have a unique
alias for the given scope.

CTEs and subqueries are mostly similar, with the exception that CTEs can be used recursively. Both
concepts are incredibly important to know and practice, since most of the harder SQL interview
questions essentially boil down to breaking the problem into smaller chunks of CT Es and
subqueries.

Ace the Data Science Interview 153


ACE THE DATA SCIENCE INTERVIEW I
Window Functions
Window functions perform calculations across a set of rows, much like aggregation functions, but do
not group those rows as aggregation functions do. Therefore, rows retain their separate identities
even with aggregated columns. Thus, window functions are particularly convenient when we want to
use both aggregated and non-aggregated values at once. Additionally, the code is often easier to
manage than the alternative: using group by statements and then performing joins on the original
input table.
Syntax-wise, window functions require the OVER clause to specify a particular window. This window
has three components:

Ace the Data Science Interview 154


ACE THE DATA SCIENCE INTERVIEW I

 Partition Specification: separates rows into different partitions, analogous to how GROUP BY
operates. This specification is denoted by the clause PARTITION BY
 Ordering Specification: determines the order in which rows are processed, given by the clause
ORDER BY
 Window Frame Size Specification: determines which sliding window of rows should be
processed for any given row. The window frame defaults to all rows within a partition but can be
specified by the clause ROWS BETWEEN (start, end)
For instance, below we use a window function to sum up the total Reddit posts per user, and then
add each post count to each row of the users table:

Note that a comparable version without using window functions looks like the following:

As you can see, window functions tend to lead to simpler and more expressive SQL.

LAG and LEAD


Two popular window functions are (LAG) and (LEAD). These are both positional window functions,
meaning they allow you to refer to rows after the current row (LAG), or rows before the current row
(LEAD). The below example uses LAG so that for every post, it finds the time difference between the
post at hand, and the post made right before it in the same subreddit:

Ace the Science Interview 155


ACE DAVA SCIENCE INTERVIEW I SINGH

RANK
Say that for each user, we wanted to rank posts by their length. We can use the window function
RANK() to rank the posts by length for each user:

Databases and Systems


Although knowing all of the database's inner workings isn't strictly necessary, having a high-level
understanding of basic database and system design concepts is very helpful. Database interview
questions typically do not involve minutiae about specific databases but, instead, focus on how
databases generally operate and what trade-offs are made during schema design. For example. you
might be asked how you'd set up tables to represent a real-world situation, like storing data if you
worked at Reddit, You’d need to define the core tables (users, posts, subreddits) and then define the
relationships between. For the Reddit example, posts would have a user_id column for the
corresponding user that made the post.
You also may be asked to choose which columns should be indexed, which allows for more rapid
lookup of data. For the Reddit example, you would want to index the user_id column, since its likely
heavily used as a join key across many important queries.
While data science interviews don't go into system design concepts as deeply or as often as software
engineering and data engineering interviews, it can still show up from time to time. This is
particularly the case if you are joining a smaller company, where your data science job might involve
creating and

Ace the Data Science Interview


ACE DAVA SCIENCE INTERVIEW I SINGH

managing data pipelines. Besides generic questions about scaling up data infrastructure, you might
be asked conceptual questions about popular large-scale processing frameworks (Hadoop, Spark) or
orchestration frameworks (Airflow, Luigi) — especially if you happen to list these technologies on
your resume.

Keys & Normalization


Primary keys ensure that each entity has its own unique identifier, i.e., no rows in a table are
duplicated with respect their primary key. Foreign keys, on the other hand, establish mappings
between entities. By using a foreign key to link two related tables, we ensure that data is only stored
once in the database. For the Reddit example, in the posts table, the post_id column is the primary
key, and each post has a user_id which is a foreign key to the users table.

Item Primary Key Foreign Key


Consists of One or More Columns Yes Yes
Duplicate Values Allowed No Yes
NULLs Allowed No Yes
Uniquely Identify Rows in a Table Yes Maybe
Number Allowed Per Table One Zero or More
Indexed Automatically Indexed No Index Automatically created

Keys allow us to split data efficiently into separate tables, but still enforce a logical relationship
between two tables, rather than having everything duplicated into one table. This process of
generally separating out data to prevent redundancy is called normalization. Along with reducing
redundancy, normalization helps you enforce database constraints and dependencies, which
improves data integrity.
The disadvantage to normalization is that now we need an expensive join operation between the
two related tables. As such, in high-performance systems, denormalization is an optimization
technique where we keep redundant data to prevent expensive join operations. This speeds up read
times, but at the cost of having to duplicate data. At scale, this can be acceptable since storage is
cheap, but compute is expensive.
When normalization comes up in interviews, it often concerns the conceptual setup of database
tables: why a certain entity should have a foreign key to another entity, what the mapping
relationship is between two types of records (one-to-one, one-to-many, or many-to-many), and
when it might be advantageous to denormalize a database.

Properties of Distributed Databases


Two concepts, the CAP theorem and the ACID framework, are commonly used to assess theoretical
guarantees of databases and are discussed in detail below.
The CAP theorem provides a framework for assessing properties of a distributed database, although
only two of the theorem's three specifications can be met simultaneously. The name CAP is an
acronym based on the following desirable characteristics of distributed databases:
Consistency: All clients using the database see the same data.
Availability: The system is always available, and each request receives a non-error response, but
there's no guarantee that the response contains the latest data.
Partition tolerance: The system functions even if communication between nodes is lost or delayed.

Ace the Data Science Interview


CHAPTER 8 : SQL & DB DESIGN

Although the CAP theorem is a theoretical framework, one should consider the real-life trade-offs
that need to be made based on the needs of the business and those of the database's users. For
example, the Instagram feed focuses on availability and less so on consistency, since what matters is
that you get a result instantly when visiting the feed. The penalty for inconsistent results isn't high.
It's not going to crush users to see @ChampagnePapi's last post has 57,486 likes (instead of the
correct 57,598 likes). In contrast, when designing the service to handle payments on Whatsapp,
you'd favor consistency over availability, because you'd want all servers to have a consistent view of
how much money a user has to prevent people from sending money they didn't have. The downside
is that sometimes sending money takes a minute or a payment fails and you are asked to re-try. Both
are reasonable trade-offs in order to prevent double-spend issues.
The second principle for measuring the correctness and completeness of a database transaction is
called the ACID framework. ACID is an acronym derived from the following desirable characteristics:
 Atomicity: an entire transaction occurs as a whole or it does not occur at all (i.e., no partial
transactions are allowed). If a transaction aborts before completing, the database does a
"rollback" on all such incomplete transactions. This prevents partial updates to a database, which
cause data integrity issues.
 Consistency: integrity constraints ensure that the database is consistent before and after a given
transaction is completed. Appropriate checks handle any referential integrity for primary and
foreign keys.
 Isolation: transactions occur in isolation so that multiple transactions can occur independently
without interference. This characteristic properly maintains concurrency.
 Durability: once a transaction is completed, the database is properly updated with the data
associated with that transaction, so that even a system failure could not remove that data from
it.
The ACID properties are particularly important for online transactional processing (OLTP) systems,
where databases handle large volumes of transactions conducted by many users in real time.

Scaling Databases
Traditionally, database scaling was done by using full-copy clusters where multiple database servers
(each referred to as a node within the cluster) contained a full copy of the data, and a load balancer
would roundrobin incoming requests. Since each database server had a full copy of the data, each
node experienced the issues mentioned in the CAP theorem discussed above (especially during high-
load periods). With the advent of the cloud, the approach towards scaling databases has evolved
rapidly.
Nowadays, the cloud makes two main strategies to scaling feasible: vertical and horizontal scaling.

158 Ace the Data Science Interview I SQL & DB Design


ACE THE DATA SCIENCE I SINGH

Vertical scaling; also known as scaling up, involves adding CPU and RAM to existing machines. This
approach is easy to administer and does not require Changing the way system is architected.
However, vertical scaling can quickly become prohibitively expensive, eventually limiting the scope
for upgrades. This is because certain machines may be close to their physical limits, making it
practically impossible to replace them with more performant servers.
In horizontal scaling, also known as scaling out, more commodity machines (nodes) are added to the
resource pool. In comparison to vertical scaling, horizontal scaling has a much cheaper cost structure
and has better fault tolerance than vertical scaling. However, as expected, there are trade-offs with
this approach. With many more nodes, you have to deal with issues that arise in any distributed
system, like handling data consistency between nodes. Therefore, horizontal scaling offers a greater
set of challenges in infrastructure management compared to vertical scaling.
Sharding, in which database rows themselves are split across nodes in a cluster, is a common
example of horizontal scaling. For all tables, each node has the same schema and columns as the
original table, but the data are stored independently of other shards. To split the rows of data, a
sharding mechanism determines which node (shard) that data for a given key should exist on. This
sharding mechanism can be a hash function, a range, or a lookup table. The same operations apply
for reading data as well, and so, in this way, each row of data is uniquely mapped to one particular
shard.

Relational Databases vs. NoSQL Databases


Relational databases, like MySQL and Postgres, have a table-based structure with a fixed, pre-defined
schema. In contrast, NoSQL databases (named because they are "non-SQL" and "non-relational")
store data in a variety of forms rather than in a strict table-based structure.

SQL Database NoSQL Database

Column Graph

Relational Key-Value Document


One type of NoSQL database is the document database. MongoDB, the most popular document
database, associates each record with a document. The document allows for arbitrarily complex,
nested, and varied schemas inside it. This flexibility allows for new fields to be trivially added
compared to a relational database, which has to adhere to a pre-defined schema.

Ace the Data Science Interview 159


ACE THE DATA SCIENCE I SINGH

Another type of NoSQL database is the graph database. Ne04J is a well-known graph database,
which stores each data record along with direct pointers to all the other data records it is connected
to.
By making the relationships between the data as important as storing the data itself, graph
databases allow for a more natural representation of nodes and edges, when compared to relational
databases.

BASE Consistency Model


Analogous to the ACID consistency model for relational databases, the BASE model applies to NoSQL
databases:
 Basically Available: data is guaranteed to be available; there will be a response to any request.
This occurs due to the highly distributed approach of NoSQL databases. However, the requested
data may be inconsistent and inaccurate.
 Soft State: system's state may change over time, even without input. These passive changes
occur due to the eventual consistency property.
 Eventual Consistency: data will eventually converge to a consistent states although no
guarantees are made on when that will occur.
If you compare and contrast ACID and BASE, you will see that the BASE model puts a stronger focus
on availability and scalability but less of an emphasis on data correctness.

MapReduce
MapReduce is a popular data processing framework that allows for the concurrent processing of
large volumes of data. MapReduce involves four main steps:

The overall MapReduce word count process

Ace the Data Science Interview 160


ACE THE DATA SCIENCE I SINGH

Ace the Data Science Interview 161


ACE THE DATA SCIENCE INTERVIEW I & SINGH

1) Split step: splits up the input data and distributes it across different nodes
2) Map step: takes the input data and outputs <key, value> pairs
3) Shuffle step: moves all the <key, value> pairs with the same key to the same node
4) Reduce step: processes the <key, value> pairs and aggregates them into a final output

The secret sauce behind MapReduce’s efficiency is the shuffle step; by grouping related data onto
the same node, we can take advantage of the locality of data. Said another way, by shuffling the
related <key, value> pairs needed by the reduce step to the same node rather than sending them to
a different node for reducing, we minimize node-to-node communication, which is often the
bottleneck for distributed computing.
For a concrete example of how MapReduce works, assume you want to count the frequency of
words in a multi-petabyte corpus of text data. The MapReduce steps are visualized on left:
Here's how each MapReduce step operates in more detail:
1. Split step: We split the large corpus of text into smaller chunks and distribute the pieces to
different machines.
2. Map step: Each worker node applies a specific mapping function to the input data and writes the
output <key, value> pairs to a memory buffer. In this case, our mapping function Simply converts
each word into a tuple of the word and its frequency (which is always 1). For example, say we
had the phrase "hello world" on a single machine. The map step would convert that input into
two key value pairs: <”hello”,1> and <”world”,1>. We do this for the entire corpus, so that if our
corpus is words big, we end up with key-value pairs in the map step.
3. Shuffle step: Data is redistributed based on the output keys from the prior step's map function,
such that tuples with the same key are located on the same worker node. In this case, it means
that all tuples of <”hello”,1> will be located on the same worker node, as will all tuples of
<”world”,1>, and so on.
4. Reduce step: Each worker node processes each key in parallel using a specified reducer
operation to obtain the required output result. In this case, we just sum up the tuple counts for
each key, so if there are 5 tuples for <”hello”,1> then the final output will be <"hello", 5>,
meaning that the word "hello" occurred 5 times.

Because the shuffle step moved all the "hello" key-value pairs to the same node, the reducer can
operate locally and, hence, efficiently. The reducer doesn't need to communicate with other nodes
to ask for their "hello" key-value pairs, which minimizes the amount of precious node-to-node
bandwidth consumed.
In practice, since MapReduce is. just the processing technique, people rely on Hadoop to manage
the steps of the MapReduce algorithm. Hadoop involves:
1) Hadoop File System (HDFS): manages data storage, backup, and replication
2) MapReduce: as discussed above
3) YARN: a resource manager which manages job scheduling and worker node
orchestration
Spark is another popular open-source tool that provides batch processing similar to Hadoop, with a
focus speed and reduced disk operations. Unlike Hadoop, it uses RAM for computations enabling
faster in memory performance but higher running costs. Additionally, unlike Spark has built in
resource scheduling and monitoring, whereas MapReduce relies on external resource managers like
YARN.

Ace the Data Science Interview 162


ACE THE DATA SCIENCE INTERVIEW I & SINGH

SQL & Database Design Questions


Easy Problems
8.1. Facebook: Assume you have the below events table on app analytics, Write a query to get the
clickthrough rate per app in 2019.
events
Aa column_name type
App_id integer
Event_id string ("impression", "click")
timestamp datetime

8.2. Robinhood: Assume you are given the tables below containing information on trades and users.
Write a query to list the top three cities that had the most number of completed orders.

8.3.

New York Times: Assume that you are given the table below containing information on viewership by
device type (where the three types are laptop, tablet, and phone). Define "mobile" as the sum
of tablet and phone viewership numbers. Write a query to compare the viewership on laptops
versus mobile devices.
viewership
Aa column name type
user id integer
Device_type string
View_time datetime

8.4. Amazon: Assume you are given the table below for spending activity by product type. Write a
query to calculate the cumulative spend so far by date for each product over time in
chronological order.

Ace the Data Science Interview 163


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Total_trans
Aa column name type
order id integer
user id integer
product_id string
spend float
trans date datetime

8.5. eBay: Assume that you are given the table below containing information on various orders
made by customers. Write a query to obtain the names of the ten customers who have ordered
the highest number of products among those customers who have spent at least $ 1000 total.
user_transactions
Aa column_name type
Transaction_id integer
product id integer
user id integer
spend float
trans date datetime

8.6. Twitter: Assume you are given the table below containing information on tweets. Write a query
to obtain a histogram of tweets posted per user in 2020.
tweets
Aa column_name type
tweet id integer
user id integer
msg string
tweet date datetime

8.7. Stitch Fix: Assume you are given the table below containing information on user purchases.
Write a query to obtain the number of people who purchased at least one or more of the same
product on multiple days.
purchases
Aa column name type
purchase id integer
user id integer
product id integer
quantity integer
price float
purchase_time datetime

Ace the Data Science Interview 164


ACE THE DATA SCIENCE INTERVIEW I & SINGH

8.8. Linkedin: Assume you are given the table below that shows the job postings for all companies
on the platform. Write a query to get the total number of companies that have posted
duplicate job listings (two jobs at the same company with the same title and description).
job_listings
Aa column name type
Job_id integer
Company_id integer
title string
description string
post_date datetime

8.9. Etsy: Assume you are given the table below on user transactions. Write a query to obtain the
list of customers whose first transaction was valued at $50 or more.
user_transactions
Aa column_name type
Transaction_id integer
product_id integer
User_id integer
spend float
transaction date datetime

8.10. Twitter: Assume you are given the table below containing information on each user's tweets
over a period of time. Calculate the 7-day rolling average of tweets by each user for every date.
tweets
Aa column_name type
tweet id integer
msg string
user id integer
Tweet_date datetime

8.11. Uber: Assume you are given the table below on transactions made by users. Write a query to
obtain the third transaction of every user.
transactions
Aa column_name type
user id integer
spend float
transaction date datetime

8.12. Amazon: Assume you are given the table below containing information on customer spend on
products belonging to various categories. Identify the top three highest-grossing items within
each category in 2020.

Ace the Data Science Interview 165


ACE THE DATA SCIENCE INTERVIEW I & SINGH

product_ spend
Aa column_name type
transaction id integer
category id integer
product_id integer
User_id integer
spend float
transaction date datetime

8.13. Walmart: Assume you are given the below table on transactions from users. Bucketing users
based on their latest transaction date, write a query to obtain the number of users who made a
purchase and the total number of products bought for each transaction date.
user_transactions
Aa column_name type
transaction id integer
product id integer
user id integer
spend float
transaction date datetime

8.14. Facebook: What is a database view? What are some advantages views have over tables?
8.15. Expedia: Say you have a database system where most of the queries made were UPDATEs/
INSERTs/DELETEs. How would this affect your decision to create indices? What if the queries
made were mostly SELECTs and JOINs instead?
8.16. Microsoft: What is a primary key? What characteristics does a good primary key have?
8.17. Amazon: Describe some advantages and disadvantages of relational databases vs. NoSQL
databases.
8.18. Capital One: Say you want to set up a MapReduce job to implement a shuffle operator, whose
input is a dataset and whose output is a randomly ordered version of that same dataset. At a
high level, describe the steps in the shuffle operator's algorithm.
8.19. Amazon: Name one major similarity and difference between a WHERE clause and a HAVING
clause in SQL.
8.20. KPMG: Describe what a foreign key is and how it relates to a primary key.
8.21. Microsoft: Describe what a clustered index and a non-clustered index are. Compare and
contrast the two.

Medium Problems
8.22. Twitter: Assume you are given the two tables below containing information on the topics that
each Twitter user follows and the ranks of each of these topics. Write a query to obtain all
existing users on 2021-01-01 that did not follow any topic in the 100 most popular topics for
that day.

Ace the Data Science Interview 166


ACE THE DATA SCIENCE INTERVIEW I & SINGH

User_topics topic _ rankings


Aa column name type Aa column name type

user id integer topic id integer


Topic_id integer ranking integer
follow date datetime ranking_date datetime

8.23. Facebook: Assume you have the tables below containing information on user actions. Write a
query to obtain active user retention by month. An active user is defined as someone who took
an action (sign-in, like, or comment) in the current month.
user—actions
Aa column name type
user id integer
event_id string ("sign-in”, "like", "comment”)
timestamp datetime

8.24. Twitter: Assume you are given the tables below containing information on user session activity.
Write a query that ranks users according to their total session durations for each session type
between the start date (2021-01-01 ) and the end date (2021-02-01).
sessions
Aa column name type
Session_id integer
User_id integer
session_type string
duration integer
start time datetime

8.25. Snapchat: Assume you are given the tables below containing information on Snapchat users
and their time spent sending and opening snaps. Write a query to obtain a breakdown of the
time spent sending vs. opening snaps (as a percentage of total time spent) for each of the
different age groups.
activities age_breakdown
Aa column name type Aa column name type
Activity_id integer user id integer

Ace the Data Science Interview 167


user id integer Age_bucket string
type string ('send', 'open')
time_spent float
activily_date datetime

8.26. Pinterest: Assume you are given the table below containing information on user sessions,
including their start and end times. A session is considered to be concurrent with another user's
session if they overlap. Write a query to obtain the user session that is concurrent with the
largest number of other user sessions.
sessions
Aa column_name type
session id integer
start time datetime
end time datetime

8.27. Yelp: Assume you are given the table below containing information on user reviews. Define a
top-rated business as one whose reviews contain only 4 or 5 stars. Write a query to obtain the
number and percentage of businesses that are top rated.
reviews
Aa column_name type
business id integer
user id integer
review_text string
Review_stars integer
Review_date datetime

8.28. Google: Assume you are given the table below containing measurement values obtained from
a sensor over several days. Measurements are taken several times within a given day. Write a
query to obtain the sum of the odd-numbered measurements and the sum of the even-
numbered measurements by date.
measurements
Aa column name type
measurement id integer
measurement value float
Measurement_time datetime

8.29. Etsy: Assume you are given the two tables below containing information on user signups and
user purchases. Of the users who joined within the past week, write a query to obtain the
percentage of users that also purchased at least one item.

Ace the Data Science Interview 168


signups user_ purchases

Aa column name type Aa column_name type


user id integer user id integer
signup date datetime product id integer
purchase_amount float
purchase_date datetime

8.30. Walmart: Assume you are given the following tables on customer transactions and products
Find the top 10 products that are most frequently bought together (purchased in the same
transaction).
transactions products
Aa column_name type Aa column_name type
tcansaction id integer product id integer
product id integer product name string
user id integer price float
quantity integer
Transaction_time datetime

8.31. Facebook: Assume you have the table given below containing information on user logins. Write
a query to obtain the number of reactivated users (i.e., those who didn't log in the previous
month, who then logged in during the current month).
user—logins
Aa column_name type
user id integer
login date datetime

8.32. Wayfair: Assume you are given the table below containing information on user transactions for
particular products. Write a query to obtain the year-on-year growth rate for the total spend of
each product, for each week (assume there is data each week).
user_transactions
Aa column_name type
transaction id integer
product id integer
User_id integer
spend float
transaction date datetime

Ace the Data Science Interview 169


8.33. Stripe: Assume you are given the table below containing information on user transactions for a
particular business using Stripe. Write a query to obtain the account's rolling 7-day earnings.
user_transactions
Aa column_name type
transaction id integer
User_id integer
amount float
Transaction_date datetime

8.34. Facebook: Say you had the entire Facebook social graph (users and their friendships). How
would you use MapReduce to find the number of mutual friends for every pair of Facebook
users?
8.35. Google: Assume you are tasked with designing a large-scale system that tracks a-variety of
search query strings and their frequencies. How would you design this, and what trade-offs
would you need to consider?

SQL 6 Database Design Solutions


Note: Due to the variety of SQL flavors, don't be alarmed by minor variations in syntax. We've
written the SQL snippets in this book in PostgreSQL.

Solution #8.1
To get the click-through rate, we use the following query, which includes a SUM along with a IF to
obtain the total number of clicks and impressions, respectively. Lastly, we filter the timestamp to
obtain the click-through rate for just the year 2019.

Solution #8.2
"To find the cities with the top three highest number of completed orders, we first write an inner
query to join the trades and user table based on the user_id column and then filter for complete
orders. Using COUNT DISTINCT, we obtain the number of orders per city. With that result, we then
perform a simple GROUP BY on city and order by the resulting number of orders, as shown below:

Ace the Data Science Interview 170


Ace the Data Science Interview 171
CHAPTER SQL & DB DESIGN

Solution #8.3
To compare the viewership on laptops versus mobile devices, we first can use a IF statement to
define the device type according to the specifications. Since the tablet and phone categories form
the "mobile" device type, we can set laptop to be its own device type (i.e., "laptop"). We can then
simply SUM the counts for each device type:

Solution #8.4
Since we don't care about the particular order_id or user_id, we can use a window function to
partition by product and order by transaction date. Spending is then summed over every date and
product as follows:

Ace the Data Science Interview I SQL & DB Design


ACE THE DATA SCIENCE INTERVIEW & SINGH

Solution #8.5
In order to obtain a count of products by user, we employ COUNT product_id for each user; hence,
the GROUP BY is performed over user_id. To filler on having spent at least $ 1000, we use a HAVING
SUM(spend) > 1000 clause. Lastly, we order user_ids by product_id count and take the top 10.

Solution #8.6
First, we obtain the number of tweets per user in 2020 by using a simple COUNT within an initial
subquery. Then, we use that tweet column as the bucket within a new GROUP BY and COUNT as
shown below:

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

Solution #8.7
We can't simply perform a count since, by definition, the purchases must have been made on
different days (and for the same products). To address this issue, we use the window function RANK
while partitioning by user_id and product_id and then order the result by purchase time in order to
determine the purchase number. From this inner subquery, we then obtain the count of user ids for
which purchase number was 2 (note that we don't need above 2 since any purchase number above 2
denotes multiple products).

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

Solution #8.8
To find all companies with duplicate listings based on title and description, we can use a RANK()
window function partitioning on company_id, job_title, and job_description. Then, we can filter for
companies where the largest row number based on those partition fields is greater than 1, which
indicates duplicated jobs, and then take a count of the number of companies:

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

Solution #8.9
Although we could use a self join on transaction_date = MIN (transaction_date) for each user, we
could also use the ROW NUMBER window function to get the ordering of customer purchases. We
could then use that subquery to filter on customers whose first purchase (shown in row one) was
valued at 50 dollars or more. Note that this would require the subquery to include spend also:

Solution #8.10
First, we need to obtain the total number of tweets made by each user on each day, which can be
gotten in a CTE using GROUP BY with user_id and tweet_date, while also applying a COUNT DISTINCT
to tweet_id. Then, we use a window function on the resulting subquery to take an AVG number of
tweets over the six prior rows and the current row (thus giving us the 7-day rolling average), while
ordering by user_id and tweet_date:

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW & SINGH

Solution #8.11
First, we obtain the transaction numbers for each user. We can do this by using the ROW NUMBER
window function, where we PARTITION by the user_id and ORDER by the transaction_date fields,
calling the resulting field a transaction number. From there, we can simply lake all transactions
having a transaction number equal to 3.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #8.12
First, we calculate a subquery with total spend by product and category using SUM and GROUP BY.
Note that we must filter by a 2020 transaction date. Then, using this subquery, we utilize a window
function to calculate the rankings (by spend) for each product category using the RANK window
function over the existing sums in the previous subquery. For the window function, we PARTITION by
category and ORDER by product spend. Finally, we use this result and then filter for a rank less than
or equal to 3 as shown below.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #8.13
First, we obtain the latest transaction date for each user. This can be done in a CTE using the RANK
window function to get rankings of products purchased per user based on the purchase transaction
date. Then, using this CTE, we simply COUNT both the user ids and product ids where the latest rank
is 1 while grouping by each transaction date.

Ace the Data Science Interview


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #8.14
A database view is the result of a particular query within a set of tables. Unlike a normal table, a
view does not have a physical schema. Instead, a view is computed dynamically whenever it is
requested. If the underlying tables that the views reference are changed, then the views will change
accordingly. Views have several advantages over tables:
1. Views can simplify workflows by aggregating multiple tables, thus abstracting the complexity of
underlying data or operations.
2. Since views can represent only a subset of the data, they provide limited exposure of the table's
underlying data and hence increase data security.
3. Since views do not store actual data, there is significantly less memory overhead.

Solution #8.15
SQL statements that modify the database, like UPDATE, INSERT, and DELETE, need to change not only
the rows of the table but also the underlying indexes. Therefore, the performance of those
statements depends on the number of indexes that need to be updated. The larger the number of
indexes, the longer it takes those statements to execute. On the flip side, indexing can dramatically
speed up row retrieval since no underlying indexes need to be modified. This is important for
statements performing full table scans, like SELECTs and JOINs.
Therefore, for databases used in online transaction processing (OLTP) workloads, where database
updates and inserts are common, indexes generally lead to slower performance. In situations where
databases are used for online analytical processing (OLAP), where database modifications are
infrequent but searching and joining the data is common, indexes generally lead to faster
performance.

Solution #8.16
A primary key uniquely identifies an entity. It can consist of multiple columns (known as a composite
key) and cannot be NULL.
Characteristics of a good primary key are:
 Stability: a primary key should not change over time.
 Uniqueness: having duplicate (non-unique) values for the primary key defeats the purpose of the
primary key.
 Irreducibility: no subset of columns in a primary key is itself a primary key. Said another way,
removing any column from a good primary key means that the key's uniqueness property would
be violated.

Solution #8.17
Advantages of Relational Databases: Ensure data integrity through a defined schema and the ACID
properties. Easy to get-started with and use for small-scale applications. Lends itself well to vertical
scaling. Uses an almost standard query language, making learning or switching between different
types of relational databases easy.
Advantages of NoSQL Databases: Offers more flexibility in data format and representations, which
makes working with unstructured or semistructured data easier. Hence, useful when still iterating on
the data schema or adding new features/functionality rapidly like in a startup environment.
Convenient to scale with horizontal scaling. Lends itself better to applications that need to be highly
available.

Ace the Data Science Interview


CHAPTER 8: DB DESIGN

Disadvantages of Relational Databases: Data schema needs to be known in advance. Altering


schemas is possible, but frequent changes to the schema for large tables can cause performance
issues. Horizontal scaling is relatively difficult, leading to eventual performance bottlenecks.
Disadvantages of NoSQL Databases: As outlined by the BASE framework, weaker guarantees on data
correctness are made due to the soft-state and eventual consistency property. Managing data
consistency can also be difficult due to the lack of a predefined schema that's strictly adhered to.
Depending on the type of NoSQL database, it can be challenging for the database to handle some
types of complex queries or access patterns.

Solution #8.18
At a high level, to shuffle the data randomly, we need to map each row of the input data to a random
key. This ensures that the row of input data is randomly sent to a reducer, where it's simply
outputted. More concretely, the steps of the MapReduce algorithm are:
1. Map step: Each row is assigned a random value from 1,...,k, where k is the number of reducer
nodes available. Therefore, for every key, the output is the tuple (key, row).
2. Shuffle step: Rows with the same input key go to the same reducer.
3. Reduce step: For each record, the row is simply outputted.
Since the reducer only has rows that were filtered randomly for a given value of i, where i is from
1,...,k, the resulting output will be ordered randomly.

Solution #8.19
A couple of answers are possible, but here are some examples:
Similarities:
1. Both clauses are used to limit/filter a given query's results.
2. Both clauses are optional within a query.
3. Usually, queries utilizing one of the two can be transformed to use the other.
Differences:
1. A HAVING clause can follow a GROUP BY statement, but WHERE cannot.
2. A WHERE clause evaluates per row, whereas a HAVING clause evaluates per group.
3. Aggregate functions can be referred to in a logical expression if a HAVING clause is used.

Solution #8.20
Foreign keys are a set of attributes that aid in joining tables by referencing primary keys (although
joins can occur without them). Primarily, they exist to ensure data integrity. The table with the
primary key is called the parent table, whereas the table with the foreign key is called the child table.
Since foreign keys create a link between the two tables, having foreign keys ensures that these links
are valid and prevents data from being inserted that would otherwise violate these conditions.
Foreign keys can be created during CREATE commands, and it is possible to DROP or ALTER foreign
keys.
When designating foreign keys, it is important to think about the cardinality — the relationship
between parent and child tables. Cardinality can take on four forms: one-to-one (one row in the
parent table maps to one row in the child table), one-to-many (one row in the parent table maps to
many rows in the child table), many-to-one (many rows in the parent table map to one row in the

181 Ace the Science interview I SQL & DB Design


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

child table), and many-to-many (many rows in the parent table map to many rows in the child table).
The particular type of relationship between the parent and child table determines the specific syntax
used when setting up foreign keys.

Solution #8.21
Both clustered indexes and non-clustered indexes help speed up queries in a database. With a
clustered index, database rows are stored physically on the disk in the same exact order as the index.
This arrangement allows you to rapidly retrieve all rows that fall into a range of clustered index
values. However, there can only be one clustered index per table since data can only be sorted
physically on the disk in one particular way at a time.
In contrast, a non-clustered index does not match the physical layout of the rows on the disk on
which the data are stored. Instead, it duplicates data from the indexed column(s) and contains a
pointer to the rest of data. A non-clustered. index is stored separately from the table data, and
hence, unlike a clustered index, multiple non-clustered indexes can exist per table. Therefore, insert
and update operations on a non-clustered index are faster since data on the disk doesn't need to
match the physical layout as in the case of a clustered index. However, this makes the storage
requirement for a non-clustered index higher than for a clustered index. Additionally, Lookup
operations for a non-clustered index may be slower than that of a clustered index since all queries
must go through an additional layer of indirection.

Solution #8.22
First, we need to obtain the top 100 most popular topics for the given date by employing a simple
subquery. Then, we need to identify all users who followed no topic included within these top 100
for the date specified. Equivalently, we could identify those that did follow one of these topics and
then filter them out of this list of users that existed on 2021-01-01.
Two approaches are as follows:
1. use the MINUS (or EXCEPT) operator and subtract those following a top 100 topic (via an inner
join) from the entire user universe
2. use a WHERE NOT EXISTS clause in a similar fashion.
For simplicity, the solution below uses the MINUS operator. Note that we need to filter for date in
the user_topics table so that we capture only existing users as of 2020-01-01 :

Ace the Data Science Interview 182


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #8.23
In order to calculate user retention, we need to check for each user whether they were active this
month versus last month. To bucket days into each month, we need to obtain the first day of the
month for the specified date by using DATE TRUNC. We use a COUNT DISTINCT over user id to obtain
the monthly active user(MAU) count for the month. This can be put into a subquery called
curr_month, and then EXISTS can be used to check it against another subquery for the previous
months last_monlh. In that subquery, ADD MONTHS can be used with an argument of 1 to get the
previous month. thereby allowing us to Check for user actions from previous month (since that
would mean they were logged in), as shown below:

Ace the Data Science Interview 183


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #8.24
First, we can perform a CTE to obtain the total session duration by user and session type between
the start and end dates. Then, we can use RANK to obtain the rank, making sure to partition by
session type and then order by duration as in the query below:

Ace the Data Science Interview 184


ACE THE DATA SCIENCE INTERVIEW I HUO SINGH

Solution #8.25
We can obtain the total time spent on sending and opening using conditional IF statements for each
activity type while gelling the amount of time_spent in a CTE. We can also obtain the total time
spent in the same CTE. Next, we take that result and JOIN by the corresponding user_id with
activities. We filter for just send and open activity types and group by age bucket. Then, using this
CTE, we can calculate the percentages of send and open time spent versus overall time spent as
follows:

Ace the Data Science Interview 185


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #8.26
The first step is to determine the query logic for when two sessions are concurrent. Say we have two.
sessions, session 1 and session 2. Note that there are two cases in which they overlap:
1. If session 1 starts first, then the start time for session 2 is less than or equal to session 1's end
time
2. If session 2 starts first, then session 1’s end time for session 1 is greater than or equal to session
2's start time
In total, this simplifies to session 2's start time falling between session 1's start time and session 1's
end time.
With this in mind, we can calculate the number of sessions that started during the time another
session was running by using an inner join and using BETWEEN to check the concurrency case as
follows:

Ace the Data Science Interview 186


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #8.27
First, we need to identify businesses having reviews consisting of only 4 or 5 stars. We can do so by
using a CTE to find the lowest number of stars given to a business across all its reviews. Then, we can
use a SUM and IF statement to filler across businesses with a minimum review of 4 or 5 stars to get

the total number of top-rated businesses, and then divide this by the total number of businesses to
find the percentage of top-rated businesses.

Solution #8.28
First, we need to establish which measurements are odd numbered and which are even numbered.
We can do so by using the ROW NUMBER window function over the measurement_time to obtain
the measurement number during-a day. Then, we filter for odd numbers by checking if a
measurement’s mod 2 is 1 for odds or is 0 for evens. Finally, we sum by date using a conditional IF
statement, summing over the corresponding measurement_value:

Ace the Data Science Interview 187


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #8.29
First, we obtain the latest week's users. To do this, we use NOW for the current time and subtract an
INTERVAL of 7 days, thus providing the relevant user IDs to look at. By using LEFT JOIN, we have all
signed-in users, and whether they made a purchase or not. Now we take the COUNT of DISTINCT
users from the purchase table, divide it by the COUNT of DISTINCT users from the signup table, and
then multiply the results by 100 to obtain a percentage:

Solution #8.30
First, we can join the transactions and product tables together based on product id to get the user
product_name, and transaction time for the transactions. With the CTE at hand, we can do a self join
to fetch products that were purchased together by a single user by joining on transaction_id Note
that we want all pairs of products, but we don't want to overcount, i.e., if user A purchased products
X and Y in the same transaction, then we only want to count the (X, Y) transaction once, and not also
(Y, X). To handle this, we can use a condition within the inner join that the product id of A is less than
that of B (where A and B are the CTE results from before). Lastly, we use a GROUP BY clause for each
pair of products and sort by the resulting count, taking the top 10:

Ace the Data Science Interview 188


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #8.31
First, we look at all users who did not log in during the previous month. To obtain the last month's
data, we subtract an INTERVAL of 1 month from the current month's login date. Then, we use a
WHERE EXISTS against the previous month's interval to check whether there was a login in the
previous month. Finally, we COUNT the number of users satisfying this condition.

Ace the Data Science Interview 189


ACE THE DATA SCIENCE INTERVIEW I HUO & SINGH

Solution #8.32
First, we need to obtain the total weekly spend by product using SUM and GROUP BY operations and
use DATE TRUNC on the transaction date to specify a particular week. Using this information, we

then calculate the prior year's weekly spend for each product. In particular, we want to take a LAG
for 52 weeks, and PARTITION BY product, to calculate that week's prior year spend for the given
product. Lastly, We divide the current total spend by the corresponding previous 52-week lag value:

Solution #8.33
First, we need to obtain the total daily transactions using a simple SUM and GROUP BY operation.
Having the daily transactions, we then perform a self join on the table using the condition that the
transaction date for one transaction occurs within 7 days of the other, which we can check by using
the DATE_ADD function along with the condition that the earlier date doesn't precede the later date:

Ace the Data Science Interview 190


ACE THE DATA SCIENCE INTERVIEW HUO & SINGH

Solution #834
To use MapReduce to find the number of mutual friends for all pairs of Facebook users, we can think
about what the end output needs to be and then work backward. Concretely, for all given pairs of
users X and Y, we want to identify which friends they have in common, from which we'll derive the
mutual friend count. The core of this algorithm is finding the intersection between the friends list for
X and the friends list for Y. This operation can be delegated to the reducer. Therefore, it is sensible
that the key for our reduce step should be the tuple (X, Y) and that the value to be reduced is the
combination of the friends list of X and the friends list of Y. Thus, in the map step, we want to output
the tuple (X, Z) for each friend Z that X has.
As an example, assume that X is friends with [W, Y, Z] and Y is friends with [X, Z].
1. Map step: For X, we want to output the following tuples: 1) ((X, W), [W, Y, Z]), 2) ((X, Y), [W, Y, Z]),
and 3) ((X, Z), [W, Y, Z]). For Y we want to output the following tuples: l) ((X, Y), [X, Z]), and 2) ((Y,
Z), [X, Z]). Note that the key is sorted, so that (Y, X)  (X, Y).
2. Shuffle step: Each machine is delegated data based on the keys from the map step, i.e., each
tuple (X, Y). So, in the previous example, note that the map step outputs the key (X, Y) for both X
and Y, and therefore both of the keys are on the same machine. That machine will therefore have
the tuple (X, Y) as the key, and will store [W, Y, Z] and [X, Z] to be used in the reduce step.
3. Reduce step: We group by keys and take the intersection of the resulting lists. For the example of
(X, Y)  [W, Y, Z], [X, Z], we take the intersection of [W, Y, Z] and [X, Z], which is [Z]. Thus, we
return the length of the set (1) for the input (X, Y).
Therefore, we are able to identify Z as the common friend of X and Y, and can return 1 as the number
of mutual friends, The process outlined above is repeated in parallel for every pair of Facebook users
in order to find the final mutual friend counts between each pair of users.

Solution #8.35
To design a system that tracks search query strings and their frequencies, we can start with a basic
keyvalue store. For each search query string, we store the corresponding frequency in a database
table containing only those two fields. To build the system at scale, we have two options: vertical
scaling or horizontal scaling, For vertical scaling, we would add more CPU and RAM to existing

Ace the Data Science Interview 191


machines, an action that is not likely to work well at Google's scale. Instead, we should consider
horizontal scaling, in which more machines (nodes) are added to a cluster. We would then store
search query strings across a large set of nodes and be able to quickly find which node contains a
given search query string.
For the actual sharding logic, consisting of mapping query strings to particular shards, several
approaches are possible. One way is to use a range of values; for example, we could have 26 shards
and map query strings beginning with A to shard 1, B to shard 2, and so on. While this approach is
simple to implement, its primary drawback is that the data obtained could be unevenly distributed,
meaning that certain shards would need to deal with much more data than others. For example, the
shard containing strings starting with the letter 'x' will have much less load than the shard containing
strings starting with the letter 'a.'
An alternative sharding scheme could be to use a hash function that maps the query string to a
particular shard number. This is another simple solution and would help reduce the problem of all
data being mapped to the same shard. However, adding new nodes is troublesome since the hash
function must be re-run across all nodes and the data rebalanced. However, those problems can be
addressed through a method called "consistent hashing," which aids in data rebalancing when new
servers are added.

Ace the Data Science Interview 192


Coding
CHAPTER 9

Every Superman has his kryptonite, but as a Data Scientist, coding can't be yours. Between
data munging, pulling in data from APIs, and setting up data processing pipelines, writing
code is a near-universal part of a Data Scientist's job. This is especially true at smaller
companies, where data scientists tend to wear multiple hats and are responsible for
productionizing their analyses and models. Even if you are the rare Data Scientist that never
has to write production code, consider the collaborative nature of the field— having strong
computer science fundamentals will give you a leg up when working with software and data
engineers.
To test your programming foundation, Data Science interviews often take you on a stroll
down memory lane back to your Data Structures and Algorithms class (you did take one,
right?). These coding questions test your ability to manipulate data structures like lists, trees,
and graphs, along with your ability to implement algorithmic concepts such as recursion and
dynamic programming. You're also expected to assess your solution's runtime and space
efficiency using Big O notation.

Approaching Coding Questions


Coding interviews typically last 30 to 45 minutes and come in a variety of formats. Early in the
interview process, coding interviews are often conducted via remote coding assessment tools like
HackerRank, Codility, or CoderPad. During final-round onsite interviews, it's typical to write code on
a whiteboard. Regardless of the format, the approach outlined below to solve coding interview
problems applies.

Ace the Data Science Interview 193


ACE DATA SCIENCE INTERVIEW & SINGH

After receiving the problem: Don't jump right into coding. It's crucial first to make sure you are
solving the correct problem. Due to language barriers, misplaced assumptions, and subtle nuances
that are easy to miss, misunderstanding the problem is a frequent occurrence. To prevent this, make
sure to repeat the question back to the interviewer so that the two of you are on the same page.
Clarify any assumptions made, like the input format and range, and be sure to ask if the input can be
assumed to be non-null or well formed. As a final test to see if you've Understood the problem. work
through an example input and see if you get the expected output. Only after you've done these steps
are you ready to begin solving the problem.
When brainstorming a solution: First, explain at a high level how you could tackle the question. This
usually means discussing the brute-force solution. Then, try to gain an intuition for why this brute-
force solution might be inefficient, and how you could improve upon it. If you're able to land on a
more optimal approach, articulate how and why this new solution is better than the first brute-force
solution provided. Only after you've settled on a solution is it time to begin coding.
When coding the solution: Explain what you are coding. Don't just sit there typing away, leaving your
interviewer in the dark. Because coding interviews often Iet you pick the language you write code in,
you're expected to be proficient in the programming language you chose. As such, avoid pseudocode
in favor of proper compilable code. While there is time pressure, don't take many shortcuts when
coding. Use clear variable names and follow good code organization principles. Write well-styled
code — for example, following PEP 8 guidelines when coding in Python. While you are allowed to cut
some corners, like assuming helper method exists, be explicit about it and offer to fix this later on.
After you 're done coding: Make sure there are no mistakes or edge cases you didn't handle. Then
write and execute test cases to prove you solved the problem.
At this point, the interviewer should dictate which direction the interview heads. They may ask
about the time and space complexity of your code Sometimes they may ask you to refactor and
clean the code, especially if you cut some corners while coding the solution. They may also extend
the problem, often with a new constraint. For example, they may ask you not to use recursion and
instead tell you to solve the problem recursively Or, they might ask you to not use surplus memory
and instead solve the problem in place. Sometimes, they may pose a tougher variant of the problem
as a follow-up, which might require starting the problem-solving process all over again.

Space & Time Complexity


Determining the runtime and space usage (how much memory is utilized) of an algorithm is essential
for coding interviews and real-world data science. Because compute and storage resources can be
bottlenecks to machine learning model training and deployment, analyzing an algorithm's
performance can affect what techniques you choose to implement. Consider OpenAI's GPT-3
language model, which contains over 175 billion parameters and took $12 million in compute
resources to train. Much of the work bringing GPT-3 to the world involved optimizing resource usage
to efficiently train such a large model.
Computer scientists analyze and classify the behavior of an algorithm's time and space usage via
asymptotic complexity analysis. This technique considers how an algorithm performs when the input
size goes toward infinity and characterizes the behavior of the runtime and space used as a function
of n. In academic settings, we establish tight bounds on performance in terms of n using Big  (big
theta) notation. However, in industry, the technical definitions have been muddled, and we tend to
denote these tight bounds on performance using Big O notation.

Ace the Data Science Interview 194


ACE DATA SCIENCE INTERVIEW & SINGH

In the context of companies asking interview questions, we care about not just establishing tight
bounds on performance but thinking about the worst-case scenario for this performance. As such,
Big O notation often describes the "worst-case upper bound," or the longest an algorithm would run
or the maximal amount of space it would need in the worst case.
For instance, consider an array of size N. Here are the following classes of runtime complexities, from
fastest to slowest, using Big O notation:
 O(1): Constant time. Example: getting a value at a particular index from an array
 O(log N): Logarithmic time. Example: binary search on a sorted array
 O(N): Linear time. Example: using a for-loop to traverse through an array
 O(N log N): Log-linear time. Example: running mergesort on an array
 O(N^2): Quadratic time. Example: iterating over every pair of elements in an array using a
double for-loop
 O(2^N): Exponential time. Example: recursively generating all binary numbers that are N
digits Long
 O(N!): Factorial time. Example: generating all permutations of an array

The same Big-O runtime analysis concepts apply analogously to space complexity. For example, if we
need to store a copy of an input array with N elements, that would be an additional O(N) space. If we
wanted to store an adjacency matrix among N nodes, we would need O(N^2) space to keep the N-
by-N sized matrix.
For a basic example for both runtime and space complexity analysis, we can look at binary search,
where we are searching for a particular value within a sorted array, The code that implements this
algorithm is below (with an extra set of conditions that returns the closest if the exact value is not
found):

Ace the Data Science Interview 195


ACE DATA SCIENCE INTERVIEW & SINGH

If we start the binary search with an input of N elements, then at the next iteration, we would only
need to search through N/2 elements, and so on. The runtime complexity for binary search is O(log
N) since at each iteration we cut the remaining search space in half. The space complexity would
simply be O(N) since the array is size N, and we do not need auxiliary space.

Complexity Analysis Applied to ML Algorithms


For an example of complexity analysis for a machine learning technique. consider the Naive Bayes
classifier. Recall that the algorithm aims to calculate
P ( B| A ) P( A)
P ( A|B )=
P(B)
for each of the n training data points (of dimension d) for each of the classes, and that P(B|A) is the
likelihood probability, and P(A) is the prior probability. In simpler terms, Naive Bayes is counting how
many times each of the d features co-occurs within each class.
Now, consider the training runtime, For all n training points, Naive Bayes will look at the posterior
and prior probabilities over all d features, for all k classes, This will take O(nkd) total runtime since
the operations boil down to a series of counts. The space complexity is just O(kd) to store the
probabilities needed to compute results for new data points.
As another example, consider logistic regression. Recall that we need to calculate the following:
1
S ( x )= −Xβ
1+ e
for any given x, There are n training data points (each with dimension d); hence,  is a d-by-1 vector
of weights. Recall that the goal of logistic regression is to find the optimal decision boundary to split
the data into two classes. This involves multiplying each of the n training points with . which is d-
by-1 vector, so the training runtime complexity is The space complexity is just O(d) to store the
weights () to classify new data points.

Ace the Data Science Interview 196


ACE DATA SCIENCE INTERVIEW & SINGH

Data Structures
Below is a brief overview of the most common data structures used for coding interviews. The best
way to become familiar with each data structure is by implementing a basic version of it in your
favorite language. Knowing the Big-O for common operations, like inserting an element or finding an
element within the structure, is also essential. The table below can be used for reference:
Data Time Complexity Space
Structure Complexity
Average Worst Worst
Access Search Insertion Deletion Access Search Insertion Deletion

Array 0(1) 0(n) 0(n) 0(n) 0(1) 0(n) 0(n) 0(n) 0(n)
Stack 0(n) 0(n) 0(1) 0(1) 0(n) 0(n) 0(1) 0(1) 0(n)
Queue 0(n) 0(n) 0(1) 0(1) 0(n) 0(n) 0(1) 0(1) 0(n)
Linked List 0(n) 0(n) 0(1) 0(1) 0(n) 0(n) 0(1) 0(1) 0(n)
Hash Map N/A 0(1) 0(1) 0(1) N/A 0(n) 0(n) 0(n) 0(n)
Binary 0(log (n)) 0(log (n)) 0(log (n)) 0(log (n)) 0(n) 0(n) 0(n) 0(n) 0(n)
Search Tree

Arrays
An array is a series of consecutive elements stored sequentially in memory, Arrays are optimal for
accessing elements at particular indices, with an O(1) access and index time. However, they are
slower for searching and deleting a specific value, with an O(N) runtime, unless sorted. An array's
simplicity makes it one of the most commonly used data structures during coding interviews.
Common array interview questions include:
 Moving all the negative elements to one side of an array
 Merging two sorted arrays
 Finding specific sub-sequences of integers within the array, such as the longest consecutive
subsequence or the consecutive subsequence with the largest sum
A frequent pattern for array interview questions is the existence of a straightforward brute-force
solution that uses O(n) space, and a more clever solution that uses the array itself to lower the space
complexity down to O(1 ). Another pattern we've seen when dealing with arrays is the prevalence of
off-by-1 errors — it's easy to crash the program by accidentally reading past the last element of an
array.
For jobs where Python knowledge is important, interviews may cover list comprehensions, due to
their expressiveness and ubiquity in codebases. As an example, below, we use a list comprehension
to create a list of the first 10 positive even numbers. Then, we use another list comprehension to
find the cumulative sum of the first list:

Arrays are also at the core of linear algebra since vectors are represented as 1-D arrays, and matrices
are represented by 2-D arrays. For example, in machine learning, the feature matrix X can be
represented by a 2-D array, With one dimension as the number of data points (n) and the other as
the number of features (d).

Ace the Data Science Interview 197


ACE DATA SCIENCE INTERVIEW & SINGH

Linked Lists
A linked list is composed of nodes with data that have pointers to other nodes. The first node is
called the head, and the last node is called the tail. Linked lists can be circular, where the tail points
to the head. They can also be doubly linked, where each node has a reference to both the previous
and next nodes. Linked lists are optimal for insertion and deletion, with O(1) insertion time at the
head or tail, but are worse for indexing and searching, with a runtime complexity of O(N) for
indexing and O(N) for search.

Common linked list questions include:


 Reversing a linked list
 Detecting a cycle in a linked list
 Removing duplicates from a sorted linked list
 Checking if a linked list represents a palindrome
As an example, below we reverse a linked list. Said another way, given the input linked list 4  1  3
 2, we want to write a function which returns 2  3  1  4. To implement this, we first start with
a basic node class:

Then we create the LinkedList class, along with the method to reverse its elements. The reverse
function iterates through each node of the linked list. At each step, it does a series of swaps
between the pointers of the current node and its neighbors.

Ace the Data Science Interview 198


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Like array interview questions, linked list problems Often have an obvious brute-force solution that
uses O(n) space, but then also a more clever solution that utilizes the existing list nodes to reduce
the memory usage to O(1). Another commonality between array and linked list interview solutions is
the prevalence of off-by-one errors. In the linked list case, it's easy to mishandle pointers for the
head or tail nodes.

Stacks & Queues


A stack is a data structure that allows adding and removing elements in a last-in, first-out (LIFO)
order. This means the element that is added last is the first element to be removed. Another name
for adding and removing elements from a stack is pushing and popping. Stacks are often
implemented using an array or linked list.
A queue is a data structure that allows adding and removing elements in a first-in, first-out (FIFO)
order. Queues are also typically implemented using an array or linked list.

The main difference between a stack and a queue is the removal order: in the stack, there is a LIFO
order, whereas in a queue it's a FIFO order. Stacks are generally used in recursive operations,
whereas queues are used in more iterative processes.
Common stacks and queues interview questions include:
 Writing a parser to evaluate regular expressions (regex)
 Evaluating a math formula using order of operations rules
 Running a breadth-first or depth-first search through a graph
An example interview problem that uses a stack is determining whether a string has balanced
parentheses. Balanced, in this case, means every type of left-side parentheses is accompanied by
valid right-side parentheses. For instance, the string “({}((){}))” is correctly balanced, whereas the
string "{}() )” is not balanced, due to the last character, ')'. The algorithm steps are as follows:

Ace the Data Science Interview 199


ACE THE DATA SCIENCE INTERVIEW I & SINGH

1) Starting parentheses (left-sided ones) are pushed onto the stack.


2) Ending parentheses (right-sided ones) are verified to see if they are of the same type as the most
recently seen left-side parentheses on the stack.
3) If the parentheses are of the same type, pop from the stack. If they don't match, return false
since the parentheses are mismatched.
4) Continue parsing the input until it's completely processed, and the stack is empty (every pair of
parentheses was correctly accounted for), in which case, return true. Or, if the stack is not
empty, in which case, return false.

Hash Maps
A hash map stores key-value pairs. For every key, a hash map uses a hash function to compute an
index, which locates the bucket where that key's corresponding value is stored. In Python, a
dictionary offers support for key-value pairs and has the same functionality as a hash map.

Ace the Data Science Interview 200


ACE THE DATA SCIENCE INTERVIEW I & SINGH

While a hash function aims to map each key to a unique index, there will sometimes be "collisions"
where different keys have the same index. In general, when you use a good hash function, expect
the elements to be distributed evenly throughout the hash map. Hence, lookups, insertions, or
deletions for a key take constant time.
Due to their optimal runtime properties, hash maps make a frequent appearance in coding interview
questions.
Common hash map questions center around:
 Finding the unions or intersection of two lists
 Finding the frequency of each word in a piece of text
 Finding four elements a, b, c and d in a list such that a + b = c + d
An example interview question that uses a hash map is determining whether an array contains two
elements that sum up to some value. For instance, say we have a list [3, 1, 4, 2, 6, 9] and k. In this
case, we return true since 2 and 9 sum up to 11.
The brute-force method to solving this problem is to use a double for-loop and sum up every pair of
numbers in the array, which provides an O(N^2) solution. But, by using a hash map, we only have to
iterate through the array with a single for-loop. For each element in loop, we'd check whether the
complement of the number (target - that number) exists in the hash map, achieving an O(N)

solution:
Due to a hash function's ability to efficiently index and map data, hashing functions are used in many
real-world applications (in particular, with regards to information retrieval and storage). For example,
say we need to spread data across many databases to allow for data to be stored and queried
efficiently while distributed. Sharding, covered in depth in the databases chapter, is one way to split
the data. Sharding is commonly implemented by taking the given input data, and then applying a
hash function to determine which specific database shard the data should reside on.

Trees
A tree is a basic data structure with a root node and subtrees of children nodes, The most basic type
of tree is a binary tree, where each node has at most two children nodes. Binary trees can be
implemented with a left and right child node, like below:

Ace the Data Science Interview 201


ACE THE DATA SCIENCE INTERVIEW I & SINGH

There are various types of traversals and basic operations that can be performed on trees. For
example, in an in-order traversal, we first process the left subtree of a node, then process the
current node, and, finally, process the right subtree:

The two other closely related traversals are post-order traversal and pre-order traversal. A simple
way to remember how these three algorithms work is by remembering that the "post/pre/in" refers
to the placement of the processing of the root value. Hence, a post-order traversal processes the left
child node first, then the right child node and, in the end, the root node. A pre-order traversal
processes the root node first, then the left child node, and then, the right child node.

Level-order Tree Traversal:


9, 12, 5, 3, 4, 11, 2, 6, 7, 8
In-order Tree Traversal:
3. 12, 6, 4. 7, 9, 11, 5,
Pre-order Tree Traversal:
9, 12, 3. 4, 6, 7. 5, 2) 8
Post-order Tree Traversal:
3, 6.7. 4. 12, 11, 8, 2. 5. 9

For searching, insertion, and deletion, the worst-case runtime for a binary tree is O(N), where N is
the number of nodes in the tree.
Common tree questions involve writing functions to get various properties of a tree, like the depth
of a tree or the number of leaves in a tree. Oftentimes, tree questions boil down to traversing
through the tree and recursively passing some data in a top-down or a bottom-up manner. Coding
interview problems also often focus on two specific types of trees: Binary Search Trees and Heaps.

Binary Search Trees


A binary search tree (BST) is composed of a series of nodes, where any node in a left subtree is
smaller than or equal to the root, and any node in the right subtree is larger than or equal to the
root. When BSTs are height balanced so no one leaf is much deeper than another leaf from the root,
searching for elements becomes efficient. To demonstrate, consider searching for the value 9 in the
balanced BST below:

Example of a Binary Search Tree

Ace the Data Science Interview 202


ACE THE DATA SCIENCE INTERVIEW I & SINGH

To find 9, we first examine the root value, 8. Since 9 is greater than 8, the node containing 9, if it
exists, would have to be on the right side of the tree. Thus, we've cut the search space in half. Next,
we compare against the node 10. Since 9 is less than 10, the node, should it exist, has to be on the
left of 10. Again, we've cut the search space in half. In conclusion, since 10 doesn't have a left child,
we know 9 doesn't occur in the tree. By cutting the search space in half at each iteration, BSTs
support search, insertion, and deletion in O(log N) runtime.
Because of their lookup efficiency, BSTs show up frequently not just in coding interviews but in real-
life applications. For instance, B-trees, which are used universally in database indexing, are a
generalized version of BSTs. That is, they allow for more than 2 nodes (up to M children), but offer a
searching and insertion process similar to that of BST. These properties allow 'B-trees to have O(log
lookup and insertion runtimes similar to that of BSTs, where N is the total number of nodes in the B-
tree. Because of the logarithmic growth of the tree depth, database indexes with millions of records
often only have a B-tree depth of four or five layers.

Example of a B-Tree
Common BST questions cover:
 Testing if a binary tree has the BST property
 Finding the k-th largest element in a BST
 Finding the lowest common ancestor between two nodes (the closest common node to two
input nodes such that both input nodes are descendants of that node)
An example implementation of a BST using the TreeNode class, with an insert function, is as follows:

Ace the Data Science Interview 203


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Heaps
Another common tree data structure is a heap. A max-heap is a type of heap where each parent
node is greater than or equal to any child node. As such, the largest value in a max-heap is the root
value of the tree, which can be looked up in O(1) time. Similarly, for a min-heap, each parent node is
smaller than or equal to any child node, and the smallest value lies at the root of the tree and can be
accessed in constant time.

Heap Data Structure

Minimum Heap
Maximum Heap

To maintain the heap properly, there is a sequence of operations known as "heapify", whereby
values are "bubbled up/down" within the tree based on what value is being inserted or deleted. For
example, say we are inserting a new value into a min-heap. This value starts at the bottom of the
heap and then is swapped with its parent node ("bubbled up") until it is no longer smaller than its
parent (in the case of a min-heap). The runtime of this heapify operation is the height of the tree,
O(log N).
In terms of runtime, inserting or deleting is O(log N), because the heapify operation runs to maintain
the heap property. The search runtime is O(N) since every node may need to be checked in the
worst-case scenario. As mentioned earlier, heaps are optimal for accessing the min or max value
because they are at the root, i.e., O(1) lookup time. Thus, consider using heaps when you care
mostly about finding the min or max value and don't need fast lookups or deletes of arbitrary
elements. Commonly asked heap interview questions include:
 finding the K largest or smallest elements within an array
 finding the current median value in a stream of numbers
 sorting an almost-sorted array (where elements are just a few places off from their correct spot)
To demonstrate the use of heaps, below we find the k-largest elements in a list, using the heapq
package in Python:

Ace the Data Science Interview 204


ACE THE DATA SCIENCE INTERVIEW I & SINGH

Ace the Data Science Interview 205

You might also like