Ace The Data Science 2
Ace The Data Science 2
DATA SCIENCE
INTERVIEW
All rights reserved. No part of this book may be used or reproduced in any manner without written
permission except in the case of brief quotations in critical articles or reviews.
The author, and/or copyright holder, assume no responsibility for the loss or damage caused or
allegedly caused, directly or indirectly, by the use of information contained in this book. The authors
specifically disclaim any liability incurred from the user or application of the contents of this book.
Throughout this book, trademarked names are referenced. Rather than using a trademark symbol
with every occurence of a trademarked name, we state that we are using the names in an editorial
fashion only and to the benefit of the trademark owner, with no intention of infringement of the
trademark,
"Super helpful career advice on breaking into data, and landing your first job in the field.”
— Prithika Hariharan, President of Waterloo Data Science Club
Data Science Intern, Wish
"Solving the 201 interview questions is helpful for people in ALL industries, not just tech!”
— Lars Hulstaert, Senior Data Scientist, Johnson & Johnson
"Whal I found most compelling was the love story that unfolds through the book. From
the first date to the data science interview, Ace reveals his true character and what
follows is incredible. I 'm thrilled by their avant garde style that uses career
advice as a vehicle for fictional narrative, Once you pick up on it, you feel
as though you 're in on a secret even the authors weren’t aware of! "
—Naveen lyer, Former Machine Learning Engineer, Instagram
"Covers pretty much every topic I 've been tested on during data science
& analytics interviews.”
— Jeffrey Ugochukwu, Dala Analyst Intern, Federal Reserve; UC Davis Statistics' 23
"Nick, Kevin, and this book have been extremely helpful resources
as
I navigate my way into the world of data science.”
— Tania Dawood, USC MS in Communications Data Science '23
"Excellent practice to keep yourself sharp for Wall Street quant and
data science interviews! "
— Mayank Mahajan, Data Scientist, Blackstone
"The authors did an amazing job presenting the frameworks for solving
practical case study interview questions in simple, digestible terms.”
. — Rayan Roy, University of Waterloo Statistics' 23
"Navigates the intricacies of data science interviews without getting lost in them.”
— Sourabh Shekhar, Former Senior Data Scientist,
Neustar & American Express
For my family: Bin, Jing, Matt, and Allen
~ Kevin
For Mom and Dad, Priya and Dev; and my brother, Naman—
My family, who supports me in every endeavor
~ Nick
Table of Contents
Introduction………………………………………………………………………………………………………………………….. vii-x
Introduction
Data scientists are not only privileged to be solving some of the most intellectually stimulating and
impactful problems in the world – they’re getting paid very well for it too. At Facebook, the median
total compensation reported by Levels.fyi for a Senior Data Scientist is a whopping $253,000 per
year. According to data on Glassdoor, the median base salary for a Quantitative researcher at hedge
fund Two Sigma is $173,000 per year, with opportunities to DOUBLE take-home pay thanks to
generous performance bonuses.
given how intellectually stimulating and down lucrative data science is, it shouldn't be a surprise that
competition for these top data jobs is fierce. Between “entry-level” positions in data science weirdly
expecting multiple years of experience, and these entry-Level jobs themselves being relatively rare in
this field saturated with Ph.D holders, early-career data scientist face hurdles in even landing
interviews at many firms.
Worse, job seekers at all experience levels face obstacles with online applications, likely never
hearing back from most jobs they apply to. sometimes, this is due to undiagnosed weakness in a
data scientist's resume, causing recruiters to pass on talented candidates. But often, it's simply
because candidates aren't able to stand out from the sea of candidates an online job application
attracts. Forget about acing the data science interview ̶̶ given the amount of job hunting challenges
a data scientist faces, just getting an interview at a top firm can be considered an achievement itself.
Then there's the question of actually passing the rigorous technical interviews. In an effort to
minimise false positives (aka "bad hires"), top companies run everyone from intern to industry
veterans through tough technical challenges to filter out weak candidates. These interviews cover a
lot of topics because the data scientist role is itself so nebulous and varied ̶̶ what one company calls
data scientist, another company might call a data analyst, data engineer, or machine learning
engineer. Only after passing these onerous technical interviews - often three or four on same the day
̶̶ can you land your dream job in data science.
We know this all must sound daunting. Spoiler alert: it is!
The good news is that in this book we teach you exactly how to navigate the data science job search
so that you can land more interviews in the first place. We've put together a shortlist of the most
essential topics to brush on as you prepare for your interviews so that you can ace these tough
technical questions. Most importantly, to put your technical skills to the test, we included 201
interviews questions from real data scientist interviews. By solving actual problems from FANG
companies, Silicon valley darlings like Airbnb and Robinhood, and wall street films like Two Sigma
and Citadel, we're confident our book will prepare you to ace the data science interview and help
you land your dream job in data.
Who are we?
Who are we, and how'd we find ourselves writing this book?
I (Nick) have worked in various data-related roles. My first internship was at a defence contractor,
CCRi, where I did data science work for the U.S. Intelligence Community. Later in college, I interned
as a software engineer at Microsoft and at Google's Nest Labs, doing data infrastructure engineering.
After graduating from the University of Virginia with a degree in systems engineering, I started my
full-time career as a new grad software engineer on Facebook's Growth team. There, I implemented
features and ran A/B tests to boost new user retention.
After Facebook, I found myself hungry to learn the business side of data, so I joined geospatial
analytics startup SafeGraph at their first marketing hire. There I helped data science and machine
learning teams at fortune 500 retailers, hedge funds, and ad-tech startups learn about SafeGraph's
location analytics datasets.
On the side, I started to write about my career journey, and all the lessons I learned from being both
a job applicant and an interviewer. Ten million views on LinkedIn later, it's obvious the advice struck
a nerve. From posting on LinkedIn, and sending emails to my tech careers newsletter with 45,000
subscribers, I've been privileged to meet and help thousands of technical folks along their career
journey. But, there was a BIG problem.
As a mentor, I was able to point software engineers and product managers to many resources for
interview prep and career guidance, like Cracking the Coding Interview, Cracking the PM Interview
and LeetCode. But, from helping data scientists, I realized just how confusing the data science
interview process was, and how little quality material was out there to help people land jobs in data.
I reached out to Kevin to see if he felt the same way.
You might be wondering...
Because my rapping skills paled in comparison to Kelvin's, and I can't sing worth a damn (even
though my last name is Singh), it made sense to delay the mixtape and instead focus on our other
shared passion: helping people navigate the data science job hunt.
Kelvin has successfully landed multiple offers in the data world. It started in college, when he
interned on the Ad Fraud team at Facebook. After graduating from the University of Pennsylvania
with a major in computer science, and a degree in business from Wharton, Kelvin started his career
as a data scientist at Facebook, where he worked on reducing bad actors and harmful content on the
Facebook Groups' platform. After a year, Wall Street came calling. Kelvin currently works as a data
scientist at a hedge fund in New York.
On the side, Kelvin Combined his passion for data science and helping people, which led him to
found DataScienceprep.com, become a course creator on DataCamp, and coach dozens of people in
their data science careers.
vii
Ace the Data Science Interview results from Kevin is and my experience working in Silicon Valley and
Wall Street, the insights we've garnered from networking with recruiters and data science managers,
our personal experience coaching hundreds of data scientists to land their dream role, and our
shared frustration with the career and interview guidance that's been available to Data Scientists —
that is, until now!
ix
Also, make sure you've subscribed to Nick's monthly career advice email newsletter:
nicksingh.com/signup
It's just one email a month with the latest tips, resources, and guides to help you excel in your
technical career.
And speaking of email, if you have suggestions, find any mistakes, have success stories to share, or
just want to say hello, send us an email: [email protected] or feel free to
connect with us on social media.
Nick Singh
nicksingh.com — where I blog my long-form essays and career guides
Linkedin.com/in/Nipun-Singh — where I share career advice daily (please send a connection
request with a message that you've got the book. I'm close to the 30k connection limit so don't
want to miss your connection request!)
instagram.com/DJLilSingh — for a glimpse into my hobbies (DJing and being Drake's #1 Stan)
twitter.com/NipunFSingh — for tweets on careers and tech startups
Kevin Huo
linkedin.com/in/Kevin-Huo
instagrarn.com/Kwhuo
4 Resume Principles
to Live by for Data Scientists
CHAPTER 1
Before you kick off the job hunt, get your resume in order. Time and effort spend here can
pay rich dividends later on in the process. No one is going to grant you an interview if your
resume doesn't scream success and competence. so here are four principles your resume
should live by, along with miscellaneous hacks to level up. We even include our actual
resumes from our senior years of college to show you how, in real life, we practiced what we
preach to land our dreams jobs at Facebook.
Since the main aim of your resume is to land an interview, the best way to do that is by highlighting a
few of your best achievements. it's crucial that these highlights are as easy and as obvious as
possible to find so that the person reading your resume decides to grant you an interview. Recruiters
1
are busy people who often only have 10 seconds or less to review your resume and decide whether
to give you an interview. Keeping the resume short and removing fluff is key to making sure your
highlights shine through in the short timespan when a recruiter is evaluating your resume.
One way to shorten the resume and keep it focused on the highlights is by omitting non-relevant
jobs. Google doesn't care that you were a lifeguard three summers ago. The exception to this rule is
if you have zero job experience. If that's the case, do include the job on your resume to prove that
you've held a position that required a modicum of "adulting.”
Another way to make sure your resume lands you an interview is by tailoring the resume to the
specific job and company. You want the recruiter reading your resume to think you are a perfect fit
for the position you 're gunning for. For example, say you worked as a social media assistant part-
time three years ago promoting your aunt's restaurant. This experience isn't relevant for most data
science roles and can be left off your resume, But if You're applying to Facebook and interested in
ads, it's worth including. IF you're applying to be a data analyst on a marketing analytics team, it can
help to leave in the social media marketing job.
Another resume customization example: when I (Nick) applied to government contractors with my
foreign-sounding legal name (Nipun Singh), I put "U.S. Citizen" at the top of my resume. This way, the
recruiter knew I had the proper background to work on sensitive projects, and could be eligible later
for a top secret security clearance.
Principle #3: Only Include Things That Make You Look Good
Your resume should make you shine, so don't be humble or play it cool. If you deserve credit for
something, put it on your resume. But never lie or overstate the truth. It’s easy for a recruiter or
other company employee to chat with you about projects and quickly determine if you did it all or if
it was a three-person group project you're passing off as your own. And technical interviewers love
to ask probing questions about projects, so don't overstate your contributions.
Another way to look good is to not volunteer negative information. Sounds obvious enough, but I've
seen people make this mistake Often. For example. you don't have to list your GPA! Only write down
your GPA if it helps. A 3.2 at MIT might be okay to list, but a 3.2 at a local lesser-known college might
not be worth listing if you are applying to Google, especially if Google doesn't usually recruit at your
college, Why? Because Google being Google might be expecting you to be at the top of your class
with a 4.0 coming from a non-target university; As a result, a 3.2 might look bad.
Another reason to get rid of the skills section: remember "Show and Tell" in grade school? Well, it's
still way better to show rather than just to tell! Include the technologies inline with your past work
experience or portfolio projects. Listing the tech stack this way contextualizes the work you did and
shows an interviewer what you're able to achieve with different tools. Plus, in explaining your
projects and past work experiences, you'll have enough keywords covered to appease the ATS which
traditional career advice is overly focused on.
The last reason to ditch a big skills and technologies section is that you are expected to learn new
languages and frameworks quickly at most companies. The specific tools you already know are
helpful but not crucial to landing a job. You are expected to be an expert at data science in general,
not in specific tools. Plus, at large tech companies like Facebook and Google, the tools and systems
are often proprietary. Thus, it doesn't matter much about the specific tools you know— it's about
what you've actually accomplished with those tools in the past.
For example, back in college, I (Nick) made the size of the companies I worked at bigger than other
text. This way, while scanning my resume, a recruiter could quickly see and say, "Microsoft Intern.
Check. Google Intern. Check. Okay, let's interview this kid." I also bolded the User metrics for
RapStock.io on my resume. I wanted to quickly call out this information because it was sandwiched
between additional details about my projects.
Another resume convention you can break to your favor is section order. There is no hard and fast
rule on ordering your education, work experience, projects, and skills sections. Just remember: in
English, we scan from top to bottom. So list your best things at the top of the resume.
Keeping what's most important up top is an important piece of advice to remember, since it may
conflict with advice from many college career advisors Who suggest listing your university at the
top. For example, if you currently attend a small unknown university, but interned at Uber this
summer, don’t be afraid to list your work experience with Uber at the top, and move education to
the bottom. Went to MIT but have no relevant industry internships yet? Then it's fine to Icad with
your education, and not the work experience section.
Another resume role on ordering you can break: items within a section don't have be in
chronological order! For example, had a friend who interned at Google one summer, then interned
part-time at a small local startup later that fall, It's okay to keep Google at the top of the resume,
ahead of the small local startup, even though the startup experience was more recent work
experience. List first what makes you look the best.
Another convention you can safely break is keeping standard margins and spacing. Change margins
to your favor, to give yourself more space and breathing room. You can also use different font sizes
and spacing to emphasize other parts of your resume. Just don't use margin changes to double the
content on your one-page resume. If you do, you're likely drowning out your best content, which is a
big no-no(as mentioned in Principle #3).
Oh, and speaking of breaking the resume rules: you can ignore any of the tips I've listed earlier in
this chapter if it helps you tell a better Story on your resume. Earlier, I mentioned that listing
irrelevant jobs — like being a waiter — won't help you land a data science gig. But go ahead and list
it if, for example: you were a waiter who then built a project that analyzed the data behind
restaurant food waste; Same way, there's nothing wrong with listing the waiter position if you're
applying to DoorDash or Uber Eats. If listing something helps tell the story of you, leave it in.
So don't listen to folks who tell you that linking to your SoundCloud from your resume is
unprofessional. Suppose you've made a hackathon project around music, or are applying to
company like Spotify, In that case, it's perfectly fine to list the SoundCloud link since it shows a
recruiter that you followed your passions and created projects to further your interests. And by the
way; if your mixtape is fire, please email us the link to [email protected] and
we'll give it a listen.
And recruiters are compensated based on the number of candidates they can close. So, put yourself
in a recruiter's chair. Let's say you're a Silicon Valley-based company recruiter with two identically-
skilled candidates, but one lives in the Bay Area and the other lives in NYC. Which of the two are you
more likely to close? The NYC candidate who needs to decide to move to SF and uproot her family
before accepting your offer or the local person who can take your offer and start next week?
Don't List Your Phone Number Unless It's Local
Because of the spam robocall epidemic, anyone who calls you will email you first to ask for your
phone number and set up a time. So, there's no need to list it. Remember: you have 10 seconds to
rivet someone reading your resume. Don't waste a second of their time by presenting nonessential
information. Plus, if your phone number is international, it'll hurt even more, as often there's a bias
to hire local candidates. And yes, hiring managers and recruiters do notice when your number is
from a far-flung area code.
And lastly, since you're carrying your resume everywhere, get a folio to carry it. Crumpled resumes
look unprofessional. If you are in college or are a new grad, and went to a top school, get a leather
padfolio with the school logo front and center for the subtle flex.
Experience:_________________________________________________________________
Google/Nest Labs, Software Engineering Intern May-Aug 2016
On the Date Infrastructure team, built a monitoring & deployment tool for GCP Dataflow
jobs in Python (Django)
Wrote Spark jobs in Scala to take Avro-formatted data from HDFS and published it to
Google's Pub-Sub services
Microsoft, Software Engineering Intern May-Aug 2015
Reduced latency from 45 seconds to 80 milliseconds for a new monitoring dashboard for
payments team
Did the above by developing an effcient ASP.NET Web API in C# which leveraged caching
and pre-processing of payment data queried from Cosmos via Scope (Microsoft internal
versions of Hadoop File System and Hive)
CCRL Data Science Intern Jun-Aug
2014
Worked on a NLP algorithm for a contract with the Office of Naval Research
Improved F1 measure of algorithm 70% compared to the original geo-location algorithm
used by Northrup Grumman, by designing new algorithm in Scala which used Stanford
NLP package to geo-locate events in news
Projects:____________________________________________________________________
Founder, Rapstock.in Jan-My 2015
Grew site to 2,000 Monthly Active Users, and received 150,000 page views
Developed using Python (Django), d3.jS, JQuery, Bootstrap, PostgreSQL, and deployed to
Heroku
Game similar to fantasy football but players bet on the real world popularity and
commercial success of rappers the rappers' performance is based on metrics scraped
from Spotify and Billboards
"Great to see that folks stick around" - Alexis Ohanian, Founder of Reddit, commenting
on our retention metrics
Kevin Huo
EDUCATION
University of Pennsylvania - Philadelphia, PA Graduating: May 2017
The Wharton School: BS in Economics with concentrations in Statistics & Finance
School of Engineering and Applied Sciences: BSE in Computer Science
GPA: 3.65/4.00
Honors: Dean's List (2013-2014), PennApps Health Hack Award (2014)
Statistics Coursework: Modern Data Mining, Statistical Inference, Stochastic Process, Probability,
Applied Probability Modeling
2016
Held weekly recitation and office hours and responsible for grading of homework, tests, and
quizzes
Discrete Math (Spring & Fall 2014), Data Structures & Algorithms (Spring/Fall 2015 & 2016)
LANGUAGES/FRAMEWORKS
Proficient: Python, R, SQL, Java, Familiar: Javascript, HTML/CSS, Basic: OCaml, Hadoop, Linux
Unanimously, data science hiring managers have told us that not having portfolio projects
was a big red flag on a candidate's application. This holds true especially for college students
or people new to the industry, who have more to prove. From mentoring many data
scientists, we 've found that having kick-ass portfolio projects was one of the best ways to
stand out in the job hunt. And from our own experience, we know that creating portfolio
projects is a great way to apply classroom knowledge to real-world problems in order to get
some practical experience under your belt. Whichever way you slice it, creating portfolio
projects is a smart move. In this chapter; you'll learn 5 tips to level-up your data science and
machine learning projects so that recruiters and hiring managers are jumping at the chance
to interview you. We teach you how to create, position, and market your data science
project. When done right, these projects will give you something engaging to discuss during
your behavioral interviews. Plus, they'Il help make sure your cold emails get answered
(Chapter 3).
The majority of recruiters just read the project description for 10 seconds in the cold email you send
them or when reviewing your resume. Maybe — if you're lucky — they click a link to look at a
graphic or demo of the project. At this point, usually in under 30 seconds, they think to themselves,
"This is neat and relevant to the job description at hand," and decide to give you an interview.
Thus, we're optimizing our data science portfolio projects to impress the decision-maker in this
process — the busy recruiter. We're optimizing for projects that are easily explainable via email.
We're optimizing for ideas that are "tweetable": ones whose essence can be conveyed in 140
characters or less. By having this focus from day one when you kick off the portfolio project, you will
skyrocket your chances of ending up with "kick-ass" portfolio project that gets recruiters hooked.
Don’t worry if you think that focusing on the recruiter will cheapen your portfolio project's technical
merits. Believe us: the technical hiring manager and senior data scientists interviewing you will also
appreciate how neatly packaged and easily understandable your project is. And following our tips
won't stop you from making the project technically impressive; an interesting and understandable
project does not need to come at the expense of demonstrating strong technical data science skills.
Tip #1: Pick a Project Idea That Makes for an Interesting Story
Recruiters and hiring managers are human. Human beings love to hear and think in terms of stories.
You can read the book that's quickly become a Silicon Valley favorite, Sapiens: A Brief History of
Humankind, by Yuval Harari, to understand how fundamental storytelling is to our success as a
species. In the book, Harari argues that it's through the shared stories we tell each other that Homo
sapiens are able to cooperate on a global scale. We are evolutionarily hardwired to listen, remember,
and tell stories. So do yourself a favor and pick ideas to work on which help you tell a powerful story.
A powerful story comes from making sure there is a buildup, then some conflict, and a nice,
satisfying resolution to said conflict. To apply the elements of a story to a portfolio project, make
sure your work has some introductory exploratory data analysis that builds up context around what
you are making and why. Then pose a hypothesis, which is akin to a conflict. Finally, share the verdict
of your posed hypothesis to resolve the conflict you posed earlier. By structuring your portfolio like a
story, it'll be easier to talk more eloquently about your project in an interview. Plus, the interviewer
is hardwired to be more interested — and therefore more likely to remember you and your project
— when you tell it in a format that we're hardwired to love.
So, how do you discover projects that will translate into captivating stories?
Looking at trending stories in the news is a great starting point because they are popular topics that
are easy to storytell around. For example, in the fall of 2020, the biggest news stories were the
COVID-19 pandemic and the 2020 U.S. presidential election. Interesting projects on these topics
could be to look at vaccination rates by zip code for other diseases, and see how they correlate to
demographic factors in order to understand healthcare inequities and complications with vaccine
rollout plans. For the 2020 U.S. presidential election, an interesting project would be to see what
demographic factors correlate highest for a county flipping from Donald Trump in 2016 to Joe Biden
in 2020, and then predicting which counties are the most flippable for future elections.
If you ever get stuck on these newsworthy topics, data journalism teams at major outlets like the
New York Times and FiveThirtyEight have already made a whole host of visualizations related to
these issues. These can serve as inspiration or as a jumping-off point for more granular analysis.
Another easy source of ideas with good story potential is to think about problems you personally
face. You'll have a great story to tell where you're positioned as the hero problem-solver if you can
convey how annoying a problem was to you and that you needed to solve it for yourself and other
sufferers. I've seen friends at hackathons tackle projects on mental health (something they
personally struggled with), resulting in a very powerful and moving narrative to accompany the
technical demo.
Tip #2: Pick a Project Idea That Visualizes Well
A picture is-worth a thousand words. And a GIF is worth a thousand pictures. So go with a portfolio
project idea that visualizes well to stand out to recruiters. Ideally, make a cool GIF, image, or
interactive tool that summarizes your results.
I (Nick) saw the power of a catchy visualization firsthand at the last company I worked at, SafeGraph,
when we launched a new geospatial dataset. When we just wrote a blog post and put it on
SafeGraph's Twitter, we wouldn't get much engagement. But when we included a GIF of the
new dataset visualized, we'd get way more attention.
This phenomenon wasn't just isolated to social media — the power of catchy photos and GIFS even
extended to cold email. When we'd send sales emails with a GIF embedded at the top, we got much
higher engagement than when we'd send boring emails that only contained text to announce a
product. These marketing lessons apply to your data science portfolio projects as well, as you should
be emailing your work to hiring managers and recruiters (covered in detail in Chapter 3). You might
be thinking. "Why are we wasting time on this and not focusing on the complicated technical skills
that a portfolio project should demo?"
We want to-remind you: your ability to convey results succinctly and accurately is a very real skill.
Explaining your work and presenting it well is a great signal to companies, because real-world data
science means convincing stakeholders and decision makers to go down a certain path. Fancy
models are great, but not unless you can easily explain to higher ups their results and business
impact. A compelling visual is one of the easiest ways to accomplish that goal in the business world.
Demonstrating this ability through a portfolio project gives any interviewer confidence you'll be able
to excel at this aspect of data science when actually on the job.
working on passion projects, you help make companies want to invest in you over a more senior
candidate who might have more technical skills but lacks the same interest in the field.
What does this advice mean in practice? If you love basketball, then use datasets from the NBA in
your portfolio projects. Passionate about music? Classify songs into genres based on their lyrics.
Binge Netflix shows? Take the IMDB movies dataset and make your own movie recommender
algorithm,
For example, I (Nick) — a passionate hip hop music fan and DJ that's always on the hunt for
upcoming artists and new music — made RapStock.io, a platform to bet on upcoming rappers. When
talking about the project to recruiters, it was effortless for me to come across as passionate about
data science and pricing algorithms because the underlying passion for hip hop music was shining
through.
Another benefit of working on a project related to your passion: it's less of a chore to get the damn
project over the finish line when work becomes play. And getting the project done is paramount to
your success, as we later detail in tip #5.
hurting your project's uniqueness. There are also some lost learning opportunities, since collecting
and cleaning data is a big part of a data scientist's work. However, if you find something you really
love on Kaggle, it's not a ,big problem. Go for it, and maybe later find a different, complementary
dataset to add another dimension to your project.
One way to tackle interesting datasets that are unique is to scrape the data yourself. Packages like
Beautiful Soup and Scrapy in Python can help, or R vest for R users. Plus it's an excellent way to
practice your programming skills and also show how scrappy you are, And since collecting and
cleaning data is such a large part of a real data scientist's workflow, scraping your own dataset and
cleaning it up shows a hiring manager you're familiar with the whole life cycle of a data science
project.
Tip #5: Done > Perfect. Prove You Are Done
As long as your work is approximately correct, the actual technical details don't matter as much for
getting an interview. Again, as mentioned above, a recruiter will not dig into your project and notice
that you didn't remove some outliers from the data. However, a recruiter can quickly determine how
complete a project is! So make sure you go the extra mile in "wrapping up a project." See if you can
"productionize" the project.
Turn the data science analysis into a product. For example, if your project was training a classifier to
predict age from a picture of a face, go the extra step and stand up a web app that allows anyone 10
upload a photo and predict their own age. As part two to the project, use a neural net to transform
the person's face to a different age, similar to FaceApp. Putting in this extra work, and then cold-
emailing the project to hiring managers, could be your ticket into companies like Snapchat,
Instagram, and TikTok.
If your project was less productizable and more exploratory, see if you can make an interactive
visualization that helps you tell a story and share your results. For example, lees say you did an
exploratory data analysis on the relationship between median neighborhood income and quality of
school district. To wrap this project up, try to make and host an interactive map visualization so that
folks can explore and visualize the data for themselves. I like D3.js for interactive charts and Leaflet,
js for interactive maps. rCharts is also pretty cool for R users. By creating a visualization, and then
sending this completed interactive map to hiring managers at Zillow or Opendoor you'll be able to
stand out from other candidates.
Lastly, your portfolio project isn't done until it's public, so make sure you publicly share your code on
GitHub. You can also use Google Collab to host and share your interactive data analysis notebook.
Even if no one sees the code or runs the notebook (which is likely!), just having a link to it sends a
signal that you are proud enough of your work to publish it openly. It also shows that you actually
did what you said you did and didn't just fabricate something to pad the resume.
If You can't point to exact metrics like dollars earned or time saved by creating the project, you can
instead put down usage numbers as a proxy for the amount of value you created for people. Plus,
mentioning view counts or downloads or active users helps demonstrate to a business that you drove
a project to completion. It’s okay to skip out on demonstrating business value IF you work with
interesting enough data and can tell a good story. An example project we find interesting, creative
and fun, but technically simple and not obviously a driver of real business value: A Highly Scientific
Analysis
You 've crafted the perfect resume and made a kick-ass portfolio project, which means it time
to apply to an open data science position. Eagerly, you go to an online job portal and submit
your resume, and maybe even a cover letter. And then it's crickets. Not even an automated
rejection letter.
If you 've applied online and then been effectively ghosted, you 're not alone. Kevin and I
have been in the same situation plenty of times. We are all too familiar with the black hole
effect of online job applications, where it almost feels like you 're tossing your resume into
the void.
So how do you reliably land interviews, especially if you have no connections or referrals?
Two words: Cold. Emails.
While in college, Snapchat and Cloudflare interviewed me (Nick) when I had no connections or
referrals at those companies. I got these interviews by writing an email, out of the blue, to the
company's recruiters. This process is known as cold emailing (in contrast to getting a warm
introduction to a recruiter). Even my previous job at data startup SafeGraph is the result of a cold
email that sent to the CEO. We firmly believe this tactic can be a game changer on the data science
job hunt.
strong resume (Chapter l) and strong portfolio projects (Chapter 2), But if you've got your ducks in a
row, yet struggle to land the first interview, this chapter will be a game changer.
Who are we even cold emailing?
Before we talk about the content of the cold email, let's cover who we're reaching out to in the first
place.
At smaller companies with less than 50 employees, emailing the CEO or CTO works very well. At mid-
range companies (from between 50 and 250 people), see if there is a technical recruiter to email;
otherwise just a normal recruiter should do. Another option is emailing the hiring manager for the
team you want to join.
For larger companies, finding the right person can be trickier. If you are looking for internships or are
a new grad, many of the larger companies (1,000+ employees) have a person titled "University
Recruiter" or "Campus Recruiter." Reaching out to these recruiters is how I (Nick) had the most luck
when cold emailing in college.
At very large companies like FANG, there should also be dedicated recruiters only working with data
scientists. To find these recruiters, go to the company's Linkedln page and hit "employees." Then,
filter the search results by title and search for "Data Recruiter." When doing this at Google, I found
six relevant data science recruiters to reach out to.
Another option is to just filter the search by "recruiter." You’ll get hundreds of results that you can
sift through manually. Doing so at Google uncovered an "ML recruiter," "PhD (Data Science)
Recruiter, " and a "Lead Recruiter, Databases & Data Analytics (GCP)," all in just a Few minutes.
Another good source of people to email at a company is alumni from your school who work there.
Even if they work in a non-data science role, they may be able to refer you or know the right person
to connect you with. To find these people, search your university on Linkedln and click "alumni."
From there, you can filter the alumni profiles based on what companies they work at or what titles
they hold. I resort to this tactic if my first Few cold emails to hiring managers and recruiters go
unanswered.
21 Ace the Data Science Interview Cold Email Your Way to Your Dream Job in Dato
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH
I have an upcoming onsite-interview with Microsoft's Azure ML team next month, but wanted to
also interview with Uber because self-driving cars is where I believe Computer Vision will help
improve the world the most in the next decade.
Helping the world through CV became my passion after seeing the impact of the last project
made. which used CV to find and categorise skie diseases.
From reading The ATG engineering blogs, I know Uber is the best place for a passionate computer
vision engineer to make an impact, and am eager to start the interview process before I go too far
down the process with Microsoft.
One warning: be careful not to make it seem like the company you are talking to is the backup
option. To do this, make sure you convey enthusiasm for the company and mission. Below is an
example of that.
line. To make the subject line click-worthy, it's key to include your most noteworthy and
relevant
details. It's okay if the subject line is keyword driven and a "big flex," as the teens say these days.
Borrow from BuzzFeed clickbait titles — they actually work! I wouldn't go so far as to say "21 Weird
Facts That'll Leave You DYING to Hire This Data Scientist," but you get the gist.
What I (Nick) used, for example: "Former Google & Microsoft Intern Interested in FT @ X"
This subject line works because I lead immediately with my background, which is click-worthy since
Google and Microsoft are well-known companies, and I have my specific ask (for full-time software
jobs) included in the subject line. Some other subject line examples that are short and to the point if
you can 't rely on internship experience at name-brand companies:
“Computer Vision Ph.D. Interested In Waymo”
“Princeton Math Major Interested in Quant @ Goldman Sachs”
“Kaggle Champion Interested in Airbnb DS”
“UMich Junior & Past GE Intern Seeking Ford Data Science FT”
If I (Nick) found a recruiter from my alma mater (UVA), I'd be sure to include that in the subject line
to show that it's personalized. For reaching out to UVA alumni, I'd thrown in a "Wahoowa" (similarly,
a "Go Bears" or "Roll Tide" if you went to Berkeley or Alabama, respectively). Including the name of
the recruiter should also increase the click-through rate.
Example: "Dan I FinTech Hackathon Winner And Wahoo Interested in Robinhood"
Another hack: including "Re:" in the subject line to make it look like they've already engaged in
conversation with you.
business hours.
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH
I had the best luck emailing Silicon Valley recruiters at —11 A.M. or 2 P.M. P.S.T. The psychology
behind this is we've all felt ourselves counting down the minutes to lunch, aimlessly refreshing your
email and Slack to pass the time. That's a great time to catch someone. Same with the after lunch
lull. The best days I found to send emails were Tuesday through Thursday. I avoided Mondays since
that's the day many people have 1:1s or team meetings or have work they are catching up on from
the weekend. On Fridays, many people might be on PTO, or even if they are in office, have some
other kind of event like happy hour in the afternoon (or they've already mentally checked out before
the weekend).
to recruiter
Hi X,
Found your email on Hacker news. I’m a former Software Engineering Intern @ Google's Nest
Labs and Microsoft who will be graduating from college May' 17.
I'm interested in working full time at Periscope Data because of my interest in data engineering
(spent the summer on the Data Infrastructure team @ Nest) and my interest in turning data into
insights (built dashboards to do just that the past two summers).
My Two Follow-Ups
Just wanted to follow up with you about full-time roles at Periscope data. believe my interest in
data engineering, along with past experience building dashboards and visualization tools, makes
me a good fit.
***
Hi X,
Just wanted to follow up with you regarding opportunities with Periscope Data. I will be in
the bay area doing interviews with Facebook and Uber next week. Would love a chance to
do a phone interview with Periscope Data this week to assess technical fit: If we are a
good match, I’d be happy to swing by the office the following week for technical
interviews while I am already in town.
Thanks,
NIck Singh
I just wanted to reach out to you about new grad software engineering position @ Airbnb.
My friend Y, interned at Airbnb on the infrastructure team and really love their experience.
This past summer, I was on the data infrastructure team at Google's Nest Labs. From
talking to Y, I think I can be good fit for similar team at Airbnb. Let Me know what the
next steps are.
Thanks
Nick Singh
Cold Email to Reddit in 2015
Hello X,
I saw your post on hacker news and wanted to reach out regarding why I am a good fit to be a
software engineer intern at reddit for summer 2016.
I interned at Microsoft this past summer on the payments team where I helped the team turn data
into inside to diagnose payment issue faster.
In my free time (when I'm not on reddit) I build RapStock.io which I grew to 2000 users. 1400 out
of the 2000 users came from reddit when we went viral so I have a soft spot for the community and
product.
Let me know what next steps I should take.
Auren,
I'm super Interested in the COS rote at Safe Graph. I applied on Angel list but figured I'd also shoot you an email.
Thanks,
Nipun Singh
www.nipunsingh.com
Now that you've finally built a kick-ass resume, compiled an impressive project portfolio and
intrigued the HR department at your dream company enough to tall you in for an interview
based on your strategically written emails, you're ready to ace the technical data science
interview questions and land the job. But there's one more piece to the puzzle whose
importance is usually underestimated: the behavioral interview. While it's true that 90% of
the reason candidates pass interviews for the most coveted big tech and finance jobs is
because of their technical skills — their ability to code on the spot, write SQL queries, and
answer conceptual statistics questions — neglecting the other 10% which stems from the
behavioral inter-view, can be a huge mistake.
bathroom. While you're pissing, the company values are in your face. No joke. It’s that paramount.
So imagine if you came to an interview with me and didn't share any stories that exhibited
SafeGraph's company values. You'd have a pretty piss-poor chance of passing the interview!
even looking for?" Behavioral questions have to do with...well...behavior. There are three basic kinds
of things an employer tests for:
Soft skills: How well do you communicate? So much of effective data science is dealing with
stakeholders — would you be able to articulate your proposals to them, or sell your ideas
convincingly enough to get buy-in from them? How well do you work with others? Data science
is a team sport, after all! How do you deal with setbacks and failures? Do you get defensive, or
exhibit a growth mindset?
Position fit: How interested are you in the job and team you're gunning for? What motivates
you about the position — only the paycheck or passion as well?
Culture fit: How well do you fit the team and company’s culture? Can you get behind the
company's mission and values? Basically the corporate version of a "vibe check"!
Essentially, while technical interviews are about whether you can do the job, behavioral interviews
are about whether you want to do the job, and if you are someone others will want to work with.
Fortunately, you can have the charisma of Sheldon Cooper from Big Bang Theory and still pass
behavioral interviews — if you prepare for the most common behavioral questions asked in data
science interviews.
The #1 question after “tell me about yourself” is: "Tell me about a situation where" something
happened. Note that this question can be phrased in various ways: "Give me an example of when
you
X" and "Describe a situation when you Y." This is your time to share war stories from past jobs. If you
lack work experience, this is the time for your independent portfolio projects to shine.
A superSTAR Answer
The trick to answering the behavioral questions we listed earlier on the spot is... well...to NOT
answer them on the spot! A lot of preparation needs to go into this so you can give effortless off-the-
cuff answers come interview time. Your first step in preparing flawless answers is to prepare stories
that address the questions we mentioned earlier. Bul don't prepare factual answers.
Prepare stories.
“But I’m no storyteller, I’m a data scientist! How can I supposed to “weave a fascinating tale”
about something as mundane as work history?”
Luckily, there is a simple formula you can use as a framework to structure your story. It's easy to
remember, too. Just remember that a great story will make you a STAR, so you have to use the STAR
formula:
Situation — Describe a specific challenge (problem or opportunity) you or your team, your
company, or your customers encountered.
Task — Describe the goal you needed to accomplish (the project or task).
Action — Describe your role in the project or task using first person (not what your team did, but
what you did).
Result — Describe what happened as a result of your actions. What did you learn or accomplish?
Keep in mind that not all outcomes need be positive; if things didn't go your way, explain what
lesson you learned (for example, "I learned about the importance of transparency and clear
communication"). Showing that you can handle failure and learn from it is a great trait!
Write your stories out using the STAR formula. Where possible, weave into your narrative key
phrases from the job description and the company culture or values page, so that you hit the
position fit and culture fit elements of the interview.
This is an effective answer because it emphasizes how my data-driven work impacted the product
roadmap — essentially what this Amazon product analytics job is all about. It also demonstrates my
passion for new users, which jives with Amazon's company value of customer obsession.
Remember, though, a winning answer to a behavioral interview question is about more than just
words. Project the confidence of a college sophomore who thinks majoring in business means
they'll
be a CEO one day. Embody BDE — big data energy. To dial in your delivery, practice telling your
stories out loud. Do this in front of a mirror it'll force you to pay attention to your nonverbal skills,
which are also very important in an interview. Use a timer, and without rushing, ensure your answers
are under two minutes long.
pass on this! You can ask about salary once you've got the job in the bag. At that point, you are in a
far better position to discuss compensation.
As we mention later in Chapter 10: Product Sense, this is the time to leverage the company and
product research you did. You'll gain much more by asking questions that convey your interest in the
company and what they do. From your readings and product analysis, surely you must be curious
about some internal detail, design decision, or what's coming next for some product. This point in
the interview is your opportunity not only to have your intellectual curiosity fulfilled, but to impress
them with your research and genuine interest in the company and its business.
Another idea is to check out details about your interviewer on Linkedln. It's not uncommon to know
who you'll be interviewing with. Asking a personal question is a sure way to get the interviewer
talking about themself. And people love to do that! If you can tailor questions to their background or
projects they've worked on, great! If not, you can ask these sure-fire conversation starters:
How did you come into this role or company?
What the most interesting project you've worked on?
What do you think is the most exciting opportunity for the company or product?
In your opinion, what are the top three challenges facing the business?
What do you think is the hardest part of this role?
How do you see the company values in action during your day-to-day work?
Going against the grain from traditional career advice, we think asking questions about the role isn't
the most beneficial use of this opportunity. Sure, you're not going to get into trouble for asking
about the growth trajectory for the role at hand, or what success looks like for the position. It's just
that you'll have ample time, and it's a better use of your time to ask these questions after you have
the job. While you're in the interview mode, again, it's important to either reinforce your interest in
the company, their mission and values, or at least have the interviewer talk about themself. We
believe discussing nuances about the role isn't the most productive step to take without an offer at
hand.
Post-interview Etiquette
Whew! Your interview is finally over!
No, it's not!
Send a follow-up thank you note via email a few hours after your interview to keep your name and
abilities fresh in their mind. Plus it shows them your interest in the position is deep and sincere.
Ideally, you'Il mention a few of the specific things you connected with them over during the
interview in your email/note. This will help jog their memory as to which interviewee you were and
hopefully bring that connection to mind when they see your name again.
with you so far. We'll both connect with you as well as like and comment on the post. You'll get more
Linkedln profile views, followers, and brownie points from us this way.
One of the most crucial skills a data scientist needs to have is the ability to think
probabilistically. Although probability is a broad field and ranges from theoretical concepts
such as measure theory to more practical applications involving various probability
distributions, a strong foundation in the core concepts of probability is essential.
In interviews, probability's foundational concepts are heavily tested, particularly conditional
probability and basic applications involving PDFs of various probability distributions. In the-
finance industry, interview questions on probability, including expected values and betting
decisions, are especially common. More in-depth problems that build off of these
foundational probability topics are common in statistics interview problems, which we cover
in the next chapter. For now, we 'Il start with the basics of probability.
Basics
Conditional Probability
We are often interested in knowing the probability of an event A given that an event B has occurred.
For example, what is the probability of a patient having a particular disease, given that the patient
tested positive for the disease? This is known as the conditional probability of A given B and is often
found in the following form based on Bayes' rule:
P ( B ¿ ) P( A)
P ( A ¿ )=
P(B)
Under Bayes' rule, P(A) is known as the prior, P(B\A) as the likelihood, and P(A\B) as the posterior.
If this conditional probability is presented simply as P(A)—that is, if P(A\B) = P(A)—then A and B are
independent, since knowing about B tells us nothing about the probability of A having also occurred.
Similarly, it is possible for A and B to be conditionally independent given the occurrence of another
event C: P(A B\C) = P(A\C)P(B\C)
The statement above says that, given that C has occurred, knowing that B has also occurred tells us
nothing about the probability of A having occurred.
If other information is available and you are asked to calculate a probability, you should always
consider using Bayes' rule. It is an incredibly common interview topic, so understanding its
underlying concepts and real-life applications involving it will be extremely helpful. For example, in
medical testing for rare diseases, Bayes' rule is especially important, since it is may be misleading to
simply diagnose someone as having a disease—even if the test for the disease is considered "very
accurate"—without knowing the test's base rate for accuracy.
Bayes' rule also plays a crucial part in machine learning, where, frequently, the goal is to identify the
best conditional distribution for a variable given the data that is available. In an interview, hints will
often be given that you need to consider Bayes' rule. One such strong hint is an interviewer's
wording in directions to find the probability of some event having occurred "given that" another
event has already occurred.
Counting
The concept of counting typically shows up in one form or another in most interviews. Some
questions may directly ask about counting (e.g., "How many ways can five people sit around a lunch
table?"), while others may ask a similar question, but as a probability (e.g., "What is the likelihood
that draw four cards of the same suit?").
Two forms of counting elements are generally relevant. If the order of selection of the n items being
counted k at a time matters, then the method for counting possible permutations is employed:
n!
n∗( n−1 )∗…∗( n−k +1 ) =
( n−k ) !
In contrast, if order of selection does not matter, then the technique to count possible number of
combinations is relevant:
( nk)= k ! ( n−k
n!
)!
Knowing these concepts is necessary in order to assess various probabilities that involve counting
procedures, Therefore, remember to determine when selection does versus does not matter.
For some real-life applications of both, consider making up passwords (where order of characters
matters) versus choosing restaurants nearby on a map (where, order does not matter, only the
options). Lastly, both permutations and combinations are frequently encountered in combinatorial
and graph theory-related questions.
Random Variables
Random variables are a core topic within probability, and interviewers generally verify that you
understand the principles underlying them and have a basic ability to manipulate them. While it is
not necessary to memorize all mechanics associated with them or specific use cases, knowing the
concepts and their applications is highly recommended,
A random variable is a quantity with an associated probability distribution. It can be either discrete
(i.e., have a countable range) or continuous (have an uncountable range). The probability distribution
associated with a discrete random variable is a probability mass function (PMF), and that associated
with a continuous random variable is a probability density function (PDF). Both can be represented
by the following function of x : f x ( x)
In the discrete case, X can take on particular values with a particular probability, whereas, in the
continuous case, the probability of a particular value of x is not measurable; instead, a "probability
mass"-per unit per length around x can be measured (imagine the small interval of x and x +δ ).
Probabilities of both discrete and continuous random variables must be non-negative and must sum
(in the discrete case) or integrate (in the continuous case) to 1:
∞
Discrete : ∑ f x ( x )=1 , Continuous: ∫ f x ( x ) dx=1
x∈X −∞
The cumulative distribution function (CDF) is often used in practice rather than a variable's PMF or
PDF and is defined as follows in both cases: F x ( x ) =p (X ≤ x)
For a discrete random variable, the CDF is given by a sum: F X ( x )=∑ p(k ) whereas, for a
k≤x
Thus, the CDF, which is non-negative and monotonically increasing, can be obtained by taking the
sums of PMFs for discrete random variables, and the integral of PDFs for continuous random
variables.
Knowing the basics of PDFs and CDFs is very useful for deriving properties of random variables, so
understanding them is important. Whenever asked about evaluating a random variable, it is
essential to identify both the appropriate PDF and CDF at hand.
∫ ∫ f x , y ( x , y ) dxdy=1
−∞ −∞
This is useful, since il allows for the calculation of probabilities of events involving X and Y.
From a joint PDF, a marginal PDF can be derived. Here, we derive the marginal PDF for X by
integrating out the Y term:
∞
f x ( x ) =∫ f x, y ( x , y ) dy
−∞
It is also possible to condition PDFs and CDFs on other variables. For example, for random variables X
and Y, which are assumed to be jointly distributed, we have the following conditional probability:
∞
f x ( x ) =∫ f y ( y ) f X ∨Y (x∨ y)dy
−∞
where X is conditioned on Y. This is an extension of Bayes' rule and works in both the discrete and
continuous case, although in the former, summation replaces integration.
Generally, these topics are asked only in very technical rounds, although a basic understanding helps
with respect to general derivations of properties. When asked about more than one random
variable, make it a point to think in terms of joint distributions.
Probability Distributions
There are many probability distributions, and interviewers generally do not test whether you have
memorized specific properties on each (although it is helpful to know the basics), but, rather, to see
if you can properly apply them to specific situations. For example, a basic use case would be to
assess the probability that a certain event occurs when using a particular distribution, in which case
you would directly utilize the distribution's PDF. Below are some overviews of the distributions most
commonly included in interviews.
k
k
()
P ( X=k )= n p (1− p)
n−k
and its mean and variance are: μ=np , 2=np (1− p).
The most common applications for a binomial distribution are coin flips (the number of heads in n
flips), user signups, and any situation involving counting some number of successful events where
the outcome of each event is binary.
The Poisson distribution gives the probability of the number of events occurring within a particular
fixed interval where the known, constant rate of each event's occurrence is . The Poisson
distribution's PMF is
k
e−¿❑
P ( X=k )= ¿
k!
and its mean and variance are: μ = , 2= .
The most common applications for a Poisson distribution are in assessing counts over a continuous
interval, such as the number of visits to a website in a certain period of time or the number of defects
in a square foot of fabric. Thus, instead of coin flips with probability p of a head as a use case of the
binomial distribution, applications on the Poisson will involve a process X occurring at a rate .
( )
2
1 (x−μ)
f ( x )= exp−
√ 2 πσ 2
2 σ2
and its mean and variance are given by: μ=μ , σ 2=σ 2
Many applications involve the normal distribution, largely due to (a) its natural fit to many real-life
occurrences, and (b) the Central Limit Theorem (CLT). Therefore, it is very important to remember
the normal distribution's PDF.
Markov Chains
A Markov chain is a process in which there is a finite set of states, and the probability of being in a
particular state is only dependent on the previous state. Stated another way, the Markov property is
such that, given the current state, the past and future states it will occupy arc conditionally
independent.
The probability of transitioning from state i to state j at any given time is given by a transition matrix,
denoted by P:
( P11 … P1 n
… ¿ ¿
… ¿ P mn¿
)
Various characterizations are used to describe states. A recurrent state is one whereby, if entering
that state, one will always transition back into that state eventually. In contrast, a transient state is
one in which, if entered, there is a positive probability that upon leaving, one will never enter that
state again.
A stationary distribution for a Markov chain satisfies the following characteristic: π=πP , where P is
a transition matrix, and remains fixed following any transitions using P. Thus, P contains the long-run
proportions of the time that a process will spend in any particular state over time.
Usual questions asked on this topic involve setting up various problems as Markov chains and
answering basic properties concerning Markov chain behavior. For example, you might be asked to
model the states of users (new, active, or churned) for a product using a transition matrix and then
be asked questions about the chain's long-term behavior. It is generally a good idea to think of
Markov chains when multiple states are to be modeled (with transitions between them) or when
questioned concerning the long-term behavior of some system.
5.11. Morgan Stanley: You and your friend are playing a game. The two of you will continue to loss a
coin until the sequence HH or TH shows up. If HH shows up first, you win. If TH shows up first,
your friend wins. What is the probability of you winning?
5.12. JP Morgan: Say you are playing a game where you roll a 6-sided die up to two times and can
choose to stop following the first roll if you wish. You will receive a dollar amount equal to the
final amount rolled. How much are you willing to pay to play this game?
5.13. Facebook: Facebook has a content team that labels pieces of content on the platform as either
spam or not spam. 90% of them are diligent raters and will mark 20% of the content as spam
and 80% as non-spam. The remaining 10% arc not diligent raters and will mark 0% of the
content as spam and 100% as non-spam. Assume the pieces of content are labeled
independently of one another, for every rater. Given that a rater has labeled four pieces of
content as good, what is the probability that this rater is a diligent rater?
5.14. D.E. Shaw: A couple has two children. You discover that one of their children is a boy. What is
the probability that the second child is also a boy?
5.15. JP Morgan: A desk has eight drawers. There is a probability of 1/2 that someone placed a
letter in one of the desk's eight drawers and a probability of 1/2 that this person did not place
a letter in any of the desk's eight drawers. You open the first 7 drawers and find that they are
all empty. What is the probability that the 8th drawer has a letter in it?
5.16. Optiver: Two players are playing in a tennis match, and are at deuce (that is, they will play
back and forth until one person has scored two more points than the other), The first player
has a 60% chance of winning every point, and the second player has a 40% chance of winning
every point. What is the probability that the first player wins the match?
5.17. Facebook: Say you have a deck of 50 cards made up of cards in 5 different colors, with 10 cards
of each color, numbered 1 through 10. What is the probability that two cards you pick at
random do not have the same color and are also not the same number?
5.18. SIG: Suppose you have ten fair dice. If you randomly throw these dice simultaneously, what is
the probability that the sum of all the top faces is divisible by 6?
Medium
5.19. Morgan Stanley: A and B play the following game: a number k from 1-6 is chosen, and A and
B will toss a die until the first person throws a die showing side k , after which that person is
awarded $100 and the game is over. How much is A willing to pay to play first in this game?
5.20. Airbnb: You are given an unfair coin having an unknown bias towards heads or tails. How can
you generate fair odds using this coin?
5.21. SIG: Suppose you are given a white cube that is broken into 3 x 3 x 3 = 27 pieces. However,
before the cube was broken, all 6 of its faces were painted green. You randomly pick a small
cube and see that 5 faces are white. What is the probability that the bottom face is also
white?
5.22. Goldman Sachs: Assume you take a stick of length 1 and you break it uniformly at random
into three parts. What is the probability that the three pieces can be used to form a triangle?
5.23. Lyft: What is the probability Chat, in a random sequence of H’s and T’S, HHT shows up before
HTT?
5.24. Uber: A fair coin is tossed twice, and you are asked to decide whether it is more likely that
two heads showed up given that either (a) at least one toss was heads, or (b) the second toss
was a head. Does your answer change if you are told that the coin is unfair?
5.25. Facebook: Three ants are sitting at the corners of an equilateral triangle. Each ant randomly
picks a direction and begins moving along an edge of the triangle. What is the probability that
none of the ants meet? What would your answer be if there are, instead, k ants sitting on all
It corners of an equilateral polygon?
5.26. Robinhood: A biased coin, with probability p of landing on heads, is tossed n times. Write a
recurrence relation for the probability that the total number of heads after n tosses is even.
5.27. Citadel: Alice and Bob are playing a game together. They play a series of rounds until one of
them wins two more rounds than the other. Alice wins a round with probability p. What is the
probability that Bob wins the overall series?
5.28. Google: Say you have three draws of a uniformly distributed random variable between (0, 2).
What is the probability that the median of the three is greater than 1.5?
Hard
5.29. D.E. Shaw: Say you have 150 friends, and 3 of them have phone numbers that have the last
four digits with some permutation of the digits 0, 1, 4, and 9. What's the probability of this
occuring?
5.30. Spotify: A fair die is rolled n times. What is the probability that the largest number rolled is r,
for each r in 1,..,6?
5.31. Goldman Sachs: Say you have a jar initially containing a single amoeba in it. Once every
minute, the amoeba has a 1 in 4 chance of doing one of four things: (1) dying out, (2) doing
nothing, (3) splitting into two amoebas, or (4) splitting into three amoebas. What is the
probability that the jar will eventually contain no living amoeba?
5.32. Lyft: A fair coin is tossed n times. Given that there were k heads in the n tosses, what is the
probability that the first toss was heads?
5.33. Quora: You have N i.i.d. draws of numbers following a normal distribution with parameters µ
and . What is the probability that k of those draws are larger than some value Y?
5.34. Akuna Capital: You pick three random points on a unit circle and form a triangle from them.
What is the probability that the triangle includes the center of the unit circle?
5.35. Citadel: You have r red balls and w white balls in a bag. You continue to draw balls from the
bag until the bag only coritains balls of one color. What is the probability that you run out of
white balls first?
(63) = 20 = 5
6
2 64 16
where the numerator is the number of ways of splitting up 3 games won by either side, and the
denominator is the total number of possible outcomes of 6 games.
Solution #5.2
Note that there are only two ways for 6s to be consecutive: either the pair happens on rolls 1 and 2
or 2 and 3, or else all three are 6s. In the first case, the probability is given by
( )( )
2
5 1 10
2∗ =
6 6 216
and, for all three, the probability is
()
3
1 1
=
6 216
10 1 11
The desired probability is given by: + =
216 216 216
Solution #5.3
First, note that the three rolls must all yield different numbers; otherwise, no strictly increasing order
is possible. The probability that the three numbers will be different is given by the following
reasoning. The first number can be any value from 1 through 6, the second number has a 5/6 chance
of not being the same number as the first, and the third number has a 4/6 chance of not being the
prior two numbers. Thus,
1∗5
∗4
6 5
=
6 9
Conditioned on there being three different numbers, there is exactly one particular sequence that
will be in a strictly increasing order, and this sequence occurs with probability 1/3! = 1/6
5
∗1
Therefore, the desired probability is given by: 9 5
=
6 54
Solution #5.4
Note that there are a total of (1002)=4950
ways to choose two cards at random from the 100. There are exactly 50 pairs that satisfy the
condition: (1, 2),…,(50, 100). Therefore, the desired probability is:
50
=0.01
4950
Solution #5.5
Note that getting to (3, 3, 3) requires 9 moves. Using these 9 moves, it must be the case that there
are exactly three Inoves in each of the three directions (up, right, and forward). There are therefore
9! ways to order the 9 moves in any given direction. We must divide by 3! for each direction to avoid
overcounting, since each up move is indistinguishable. Therefore, the number of paths is:
Ace the Date Science Intaview 50
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH
9!
=1680
3! 3 ! 3 !
Solution #5.6
Let A denote the event that someone has the disease, and B denote the event that this person tests
positive for the disease. Then we want: P(A\B)
P ( B ¿ ) P( A)
By applying Bayes' theorem, we obtain: P( A ¿)=
P(B)
From the problem description, we know that P(B ¿)=0.98 , P( A)=0.001
Let A' denote the event that someone does not have the disease. Then, we know that P(B\A') = 0.01.
For the denominator, we have:
P(B) = P(B\A)P(A) + P(B\A’)P(A’) = 0.98(0.001) + 0.01(0.999)
Therefore, after combining terms, we have the following:
0.98∗0.001
P ( A ¿ )= =8.93 %
0.98 ( 0.001 ) +0.01( 0.999)
Solution #5.7
We can use Bayes' theorem here. Let U denote the case where we are flipping the unfair coin and F
denote the case where we are flipping a fair coin. Since the coin is chosen randomly, we know that
P(U) = P(F) = 0.5. Let 5T denote the event of flipping 5 tails in a row. Then, we are interested in
solving for P(U 5T ), i.e., the probability that we are flipping the unfair coin, given that we obtained 5
tails in a row.
We know P(5T\U) = 1, since, by definition, the unfair coin always results in tails. Additionally, we
know that P(5T\F) = 1/2 5 = 1/32 by definition of a fair coin. By Bayes' theorem, we have:
P ( 5T ¿ )∗P (U ) 0.5
P ( U ¿T )= = =0.97
P ( 5 T ¿ )∗P ( U )+ P (5 T ¿)∗P ( F ) 0.5+ 0.5∗1/32
Therefore, the probability we picked the unfair coin is about 97%.
Solution #5.8
Let P(A) be the probability that A wins. Then, we know the following to be true:
1. If A flips heads initially, A wins with probability 1.
2. If A flips tails initially, and then B flips a tail, then it is as if neither flip had occurred, and so A wins
with probability P(A).
Combining the two outcomes, we have: P(A) = p + (1 - p) 2P(A), and simplifying this yields
P(A) = p + P(A) - 2pP(A) + p2P(A) so that p2P(A) - 2pP(A) +P = 0
1
and hence: P (A) =
2− p
Solution #5.9
Let R denote the event that it is raining, and Y be a "yes" response when you ask a friend if it is
raining. Then, from Bayes' theorem, we have the following:
P (YYY ¿ ) P(R)
P ( R ¿ )=
P(YYY )
( ) ( 14 )= 272
3
2
P ( YYY ¿ ) P ( R )=
3
Let R' denote the event of no rain; then the denominator is given by the following:
( ) ( ) ( ) ( 34 )
3 3
2 1 1
P ( YYY )=P ( YYY ¿ ) P ( R ) + P ( YYY ¿ ' ) P ( R ' )= +
3 4 3
11
which, when simplified, yields: P ( YYY )= 108
2
27 8
Combining terms, we obtain the desired probability: P ( R ¿ )= =
11 11
108
Solution #5.10
By definition, a chord is a line segment where the two endpoints lie on the circle. Therefore, two
arbitrary Chords can always be represented by any four points chosen on the circle. If you choose to
represent the first chord by two of the four points, then you have:
( 42)=6
choices of choosing the two points to represent chord 1 (and, hence the other two will represent
chord 2). However, note that in this counting, we are duplicating the count of each chord twice, since
a chord with endpoints p1 and p2 is the same as a chord with endpoints p2 and p1. That is, chord AB
is the same as DA, (likewise with CD and DC). Therefore, the proper number of valid chords is:
1 4
2 2 ()
=3
Among these three configurations, only one of the chords will intersect; hence, the desired
probability is:
1
p=
3
Solution #5.11
Although there is a formal way to apply Markov chains to this problem, there is a simple trick that
simplifies the problem greatly. Note that, if T is ever flipped, you cannot then reach HH before your
friend reaches TH, since the first heads thereafter will result in them winning. Therefore, the
probability of you winning is limited to just flipping an HH initially, which we know is given by the
following probability:
1
∗1
2 1
P ( HH ) = =
2 4
Therefore, you have a 1/4 chance of winning, whereas your friend has a 3/4 chance.
Solution #5.12
The price you would be willing to pay is equal to the expectation of the final amount. Note that, for
the first roll, the expectation is
6
∑ 6i = 21
6
=3.5
i=1
Therefore, there are two events on which you need to condition. The first is on getting a 1, 2, or 3 on
the first roll, in which case you would roll again (since a new roll would have an expectation of 3.5,
and so, overall, you have an expectation of 3.5. The second is on if you roll a 4, 5, or 6 on the first
roll, in which case you would keep that roll and end the game, and the overall expectation would be
5, the average of 4, 5, and 6. Therefore, the expected payoff of the overall game is
1 1
∗3.5+ ∗5=4.25
2 2
Therefore, you would be willing to pay up to $4.25 to play.
Solution #5.13
Let D denote the case where a rater is diligent, and E the case where a rater is non-diligent. Further,
let 4N denote the case where four pieces of content are labeled as non-spam. We want to solve for
P(D\4N), and can use Bayes' theorem as follows to do so:
P ( 4 N ¿ )∗P(D)
P ( D¿ N )=
P ( 4 N ¿ )∗P ( D ) + P ( 4 N ¿ )∗P(E)
We are given that P(D) = 0.9, P(E) = 0.1. Also, we know that P ( 4 N ¿ ) = 0.8 * 0.8 * 0.8 * 0.8 due to
the independence of each of the 4 labels assigned by a diligent rater. Similarly, we know that P(4N\E)
= 1, since a non-diligent rater always labels content as non-spam. Substituting into the equation
above yields the following:
P ( 4 N ¿ )∗P(D) 4
0.8 ∗0.9
= 4 =0.79
P ( 4 N ¿ )∗P ( D ) + P ( 4 N ¿ )∗P(E) 0.8 ∗0.9+1 4∗0.1
Solution #5.14
This is a tricky problem, because your mind probably jumps to the answer of 1/2 because knowing
the gender of one child shouldn't affect the gender of the other. However, the phrase "the second
child is also a boy" implies that we want to know the probability that both children are boys given
that one is a boy. Let B represent a boy and G represent a girl. We then have the following total
sample space representing the possible genders of 2 children: BB, BG, GB, GG.
However, since one child was said to be a boy, then valid sample space is reduced to the following:
BB, BG, GB.
Since all of these options are equally likely, the answer is simply 1/3.
Solution #5.15
Let A denote the event that there is a letter in the 8th drawer, and B denote the event that the first 7
drawers are all empty.
The probability of B occurring can be found by conditioning on whether a letter was put in the
drawers or not; if so, then each drawer is equally likely to contain a letter, and if not, then none
contain the letter. Therefore, we have the following:
Solution #5.16
We can use a recursive formulation. Let p be the probability that the first player wins. Assume the
score is 0-0 (on a relative basis).
If the first player wins a game (with probability 0.6), then two outcomes are possible: with
probability 0.6 the first player wins, and with probability 0.4 the score is back to 0-0, with p being the
probability of the first player winning overall.
Similarly, if the first player loses a game (with probability 0.4), then with probability 0.6 the score is
back to 0-0 (with p being the probability of the first player winning), or, with probability 0.4, the first
player loses. Therefore, we have: p = 0.62 + 2(0.6)(0.4)p
Solving this yields the following for p: p 0.692
The key idea to solving this and similar problems is that, after two points, either the game is over, or
we're back where we started. We don't need to ever consider the third, fourth, etc., points in an
independent way.
Solution #5.17
The first card will always be a unique color and number, so let's consider the second card. Let A be
the event that the color of card 2 does not match that of card 1, and let B be the event that the
number of card 2 does not match that of card 1. Then, we want to find the following:
P ( A ∩B )
Note that the two events are mutually exclusive: two cards with the same colors cannot have the
same numbers, and vice versa: Hence, P ( A ∩B )=P ( A ) P (B ¿)
For A to occur, there are 40 remaining cards of a color different from that of the first card drawn (and
49 remaining cards altogether). Therefore,
40
P ( A )=
49
For B, we know that, of the 40 remaining cards, 36 of them (9 in each color) do not have the same
number as that of card 1.
36
Therefore, P ( B ¿ )=
40
40
∗36
Thus, the desired probability is: 49 36
P ( A ∩B )= =
40 49
Solution #5.18
Consider the first nine dice. The sum of those nine dice will be either 0, 1, 2, 3, 4, or 5 modulo 6.
Regardless of that sum, exactly one value for the tenth die will make the sum of all 10 divisible by 6.
For instance, if the sum of the first nine dice is 1 modulo 6, the sum of the first 10 will be divisible by
6 only when the tenth die shows a 5. Thus, the probability is 1/6 for any number of dice, and,
therefore, the answer is simply 1/6.
Solution #5.19
To assess the amount A is willing to pay, we need to calculate the expected probabilities of winning
for each player, assuming A goes first. Let the probability of A winning (if A goes first) be given by
P(A), and the probability of B winning (if A goes first but doesn't win on the first roll) be P(B’).
1 5 '
Then we can use the following recursive formulation: P ( A )= + (1−P( B ))
6 6
Since A wins immediately with a 1/6 chance (the first roll is k), or with a 5/6 chance (assuming the
first roll is not a k), A wins if B does not win, with B now going first.
However, notice that, if A doesn't roll side k immediately, then P(B') = P(A), since now the game is
exactly symmetric with player B going first.
1 5 5
Therefore, the above can be modeled as follows: P ( A )= + − P( A)
6 6 6
Solving yields P(A) 6/11, and P(B) = 1 - P(A) = 5/11. Since the payout is $100, then A should be willing
to pay an amount up to the difference in expected values in going first, which is 100 * (6/11 - 5/11) =
100/11, or about $9.09.
Solution #5.20
Let P(H) be the probability of landing on heads, and P(T) be the probability of landing tails for any
given flip, where P(H) + P(T) = 1. Note that it is impossible to generate fair odds using only one flip. If
we use two flips, however, we have four outcomes: HH, HT, TH, and TT. Of these four outcomes, note
that two (HT, TM) have equal probabilities since P(H) * PO) = P(T) * P(H). We can disregard HH and TT
and need to complete only two sets of flips, e.g., HHT wouldn't be equivalent to HT.
Therefore, it is possible to generate fair odds by flipping the unfair coin twice and assigning heads to
the HT outcome on the unfair coin, and tails to the TH outcome on the unfair coin.
Solution #5.21
Ace the Data Science Inteview 45
ACE THE DATA SCIENCE INTERVIEW | HUO & SINGH
The only possible candidates for the cube you selected are the following: either it is the inside center
piece (in which case all faces are white) or a middle face (where 5 faces are white, and one-face is
green). The former can be placed in six different ways, and the latter can only be placed in one
particular way. Since all cubes are chosen equally randomly, let A be the event that the bottom face
of the cube picked is white, and B be the event that the other five faces are white.
Note that there is a 1/27 chance that the piece is the center piece and a 6/27 chance that the piece
is the middle piece. Therefore, the probability of B happening is given by the following:
P ( B )=
1
27
( 1) + ()
6 1
27 6
1
P(A∩B) 27 1
P ( A ¿ )= = =
Then, using Bayes' rule: P(B) 6 2
∗1
1 27
+
27 6
Solution #5.22
Assume that the stick looks like the following, with cut points at X and Y
----------X-----|-----Y-----------
Let M (shown as I above) denote the stick's midpoint at 0.5 of the stick's I-unit length. Note that, if X
and Y fall on the same side of the midpoint, either on its left or its right, then no triangle is possible,
because, in that case, the length of one of the pieces would be greater than 1/2 (and thus we would
have two sides having a total length strictly less than that of the longest side, making forming a
triangle impossible). The probability that X and Y are on the same side (since the breaks are assumed
to be chosen randomly) is simply 1/2.
Now, assume that Xand Y fall on different sides of the midpoint. If X is further to the left in its half
than Y is in its half, then no triangle is possible in that case, since then the pan lying between X and Y
would have a length strictly greater than 0.5 (for example, X at 0.2 and Y at 0.75), This has a 1/2
chance of occurring by a simple symmetry argument, but it is conditional on X and Y being on
different sides of the midpoint, an outcome which itself has a 1/2 chance of occurring. Therefore,
this case occurs with probability 1/4. The two cases represent all cases in which no valid triangle can
be formed; thus, it follows that probability of a valid triangle being formed equals 1 — 1/2 — 1/4 =
1/4.
Solution #5.23
Note that both sequences require a heads first, and any sequence of just tails prior to that is
irrelevant to either showing up. Once the first H appears, there are three possibilities. If the next flip
is an H, HHT will inevitably appear first, since the next T will complete that sequence. This has
probability 1/2.
If the next flip is a T, there are two possibilities. If TT appears, then HTT appeared first. This has
probability 1/4. Alternatively) if TH appears, we are back in the initial configuration of having gotten
the first H. Thus, we have:
1 1
p= + p
2 4
2
Solving yields p=
3
Solution #5.24
Let A be the event that the first toss is a heads and B be the event that the second toss is a heads.
Then, for the first case, we are assessing: P(A B\A B), whereas for the second case we are
assessing: P(A B\B)
1
P( A B) P( A B) P( A B) 4 1
For the first case, we have: P ( A B ¿ B )= = = =
P( A B) P( A B) 3 3
4
1
P( A B)P (B) P( A B) 4 1
And, for the second case, we have: P ( A B ¿ )= = = =
P (B) P(B) 1 2
2
Therefore, the second case is more likely. For an unfair coin, the outcomes are unchanged, because it
will always be true that P ( A B )> P (B), so the first case will always be less probable than the second
case.
Solution-#5.25
Note that the ants are guaranteed to collide unless they each move in the exact same direction. This
only happens when all the ants move clockwise or all move counter-clockwise (picture the triangle
in 2D). Let P(N) denote the probability of no collision, P(C) denote the case where all ants go
clockwise, and P(D) denote the case where all ants go counterclockwise. Since every ant can choose
either direction with equal probability, then we have:
()()
3 3
1 1 1
P ( N )=P ( C )+ P ( D ) = + =
2 2 4
If we extend this reasoning to k ants, the logic is still the same, so we obtain the following:
() ()
k k
1 1 1
P ( N )=P ( C )+ P ( D ) = + = k−1
2 2 2
Solution #5.26
Lel A be the event that the total number of heads after n tosses is even, B be the event that the first
toss was tails, and B' be the event that the first Loss was heads. By the law of total probability, we
have the following: P ( A )=P ( A ¿ ) P ( B )+ P ( A {B¿¿ ' )¿ (B ' )
Then, we can write the recurrence relation as follows: Pn= (1− p ) P n−1+ p( 1−Pn−1)
Solution #5.27
Note that since Alice can win with probability p, Bob, by definition, can win with probability 1-p.
Denote 1-p as q for convenience. Let B represent the event that Bob wins i matches for i = 0, 1, 2. Let
B* denote the event that Bob wins the entire series. We can use the law of conditional probability as
follows:
P(B*) = P(B* \B2) * P(B2) + P(B* \B1) * P(B1) + P(B* \B0) * P(B0)
Since Bob wins each round with probability 1-p, we have: P(B2) = q2 ,P(B1) = 2pq, P(B0) = P2
Substituting these values into the above expression yields: P(B*) = 1 * q2 + P(B*) *2pq + 0 * P2
Hence, the desired probability is the following: P ¿
Solution #5.28
Because the median of three numbers is the middle number, the median is at least 1.5 if at most one
of the 3 is strictly less than 1.5 (since the other 2 must be strictly greater than 1.5). Since each is
uniformly randomly distributed, then the probability of any one of them being strictly less than 1.5 is
given by the following:
1.5 3
=
2 4
Therefore, the chance that at most one is strictly less than 1.5 is given by the sum of probabilities for
exactly one being strictly less than 1.5 and none being strictly less than 1.5
( )( )( ) ( )
2 3
3 1 1 10
p= 3 + =
1 4 4 4 64
Solution #5.29
Let p be the probability that a phone number has the last 4 digits involving only the above given
digits (0, 1, 4 and 9).
We know that the total number of possible last 4 digit combinations is 10 4 = 10000, since there are
10 digits (0-9). There are 4! ways to pick a 4-digit permutation of 0, 1, 4 and 9.
4! 3
Therefore, we have: p= =
10000 1250
Now, since you have 150 friends, the probability of there being exactly 3 with this combination is
given by:
(1503) p (1− p)
3 147
=0.00535
Solution #5.30
Let B be the event that all n rolls have a value less than or equal to r. Then we have:
n
r
P (B r )= n
6
since all n rolls must have a value less than or equal to r. Let A be the event that the largest number
is r. We have: Br=Br −1 A , and, since the two events on thc right-hand side are disjoint, we have the
following: P(B1) = P(Br-1) + P(Ar)
n
r n (r −1)
Therefore, the probability of A is given by: P ( A r )=P ( B r )−P ( Br −1) = −
6n 6n
Solution #5.31
Let p be the probability that the amoeba(s) die out. At any given time step, the probability of dying
out eventually must still be p.
For case (1), we know the probability of survival is 0 (for one amoeba).
For case (2); we know the probability of dying out is p.
For case (3), there are now two amoebas, and both have a probability p of dying.
For case (4), each of the three amoebas has a probability p of dying.
Putting all four together, we note that the probability of the population dying out at t = 0 minutes
must be the same as the probability of the population dying out at t = 1 minutes. Therefore, we
have:
1 2 3
p= (1+ p+ p + p )
4
and solving this yields: p= √ 2−1
Solution #5.32
Note that there are (n−1
k )
ways to choose k heads with the first coin being a T, and a total of
( nk)
ways to obtain k heads. So, the probability of having a tails first is given by:
( k ) n−k
n−1
=
(k ) n
n
and, therefore, the probability of obtaining a heads first is given by the following:
n−k k
1− =
n n
Solution #5.33
Let the n draws be denoted as X1,X2,…Xn
We know that, for any given draw i , we have the following:
p=1− ( Y −μ
σ )
( n) p (1− p)
Then, the desired probability is given by: k
k n−k
Solution #5.34
Note that, without loss of generality, the first point can be located at (1, 0). Using the polar
coordinate system, we have the two other points at angles: , , respectively.
Note that the second point can be placed on either half (top or bottom) without loss (if generality.
Therefore, assume that it is on the top half. Then, 0 < <
If the third point is also in the top half, then the resulting triangle will not contain the center of the
unit circle. It will also not contain the center if the following is the case (try drawing this out):
+
Therefore, for any given second point, the probability of the third point making the resulting triangle
contain the center of the unit circle is the following:
θ
p=
2π
Therefore, the overall probability is given by the integrating over possible values of θ , where the
constant in front is to take the average:
n
1 1
∫
π 0
pdθ=
4
Solution #5.35
In order to run out of white balls first, all the white balls must be drawn before the r-th red ball is
drawn. We can consider the draws until w + r — 1 (we know the last ball must be red), and count
how many include w white balls.
The first white ball has w + r - 1 options, the second white ball has w + r - 2 options, etc., until the
drawing of the w-th white ball: (w + r - 1)(w + r - 2)...(r), which can be written as a factorial:
( w+r −1 ) !
( r −1 ) !
Similarly, there are r! ways to arrange the drawing of the remaining r red balls. We know the total
number of balls is r + w, so there are (r + w)! total arrangements. Therefore, the probability is:
( w+r−1 ) !
r!
( r−1 ) ! r
=
( r +w ) ! w+r
A more intuitive way to approach the problem is to consider just the last ball drawn. The probability
that the ball is red is simply the chance of it being red when picking randomly, which is the following:
r
w+r
Statistics
CHAPTER 6
Statistics is a core component of any data scientists toolkit, Since many commercial layers,
Of a data science pipeline are built from statistical foundations (for example, A/B testing),
knowing foundational topics of statistics is essential.
Interviewers love to test a candidate knowledge about the basics of statistics, starting with
topics like the Central Limit Theorem and the Law of Large Numbers, and then progressing
on to the concepts underlying hypothesis-testing, particularly p-values and confidence
intervals, as well as Type I and Type II errors and their interpretations. All of those topics
play an important role in the statistical underpinning of A/B testing. Additionally, derivations
and manipulations involving random variables of various probability distributions are also
common, particularly in finance interviews. Lastly, a common topic in more technical
interviews will involve utilizing MLE and/or MAP.
The variance is always non-negative, and its square root is called the standard deviation, which is
heavily used in statistics.
√
σ =√ Var ( X )= E [( X−E| X|) ]= E|X |−(E| X|)
2
√ 2 2
The conditional values of both the expectation and variance are as follows. For example, consider
the case for the conditional expectation of X, given that Y = y:
∞
E [ X∨Y = y ] =∫ xf X ∨Y ( x∨ y)dx
∞
For any given random variables X and Y, the covariance, a linear measure of relationship between the
two variables, is defined by the following:
Cov ( X , Y ) =E [ ( X −E|X|)(Y −E|Y |) ] =E| XY|−E| X| E|Y |
and the normalization of covariance, represented by the Greek letter , is the correlation between X
and Y:
Cov ( X ,Y )
ρ ( X , Y )=
√ Var ( X ) Var ( Y )
All of these properties are commonly tested in interviews, so it helps to be able to understand the
mathematical details behind each and walk through an example for each.
For example, if we assume X follows a Uniform distribution on the interval [a, b], then we have the
following:
1
fx ( x ) =
b−a
Therefore the expectation of X is:
|
b b 2
x x b a+b
E ( X )=∫ xf x ( x ) dx =∫ dx= =
a a b−a 2(b−a) a 2
Although it is not necessary to memorize the derivations for all the different probability
distributions, you should be comfortable deriving them as needed, as it is a common request in
more technical interviews. To this end, you should make sure to understand the formulas given
above and be able to apply them to some of the common probability distributions like the
exponential or uniform distribution.
of heads to be approximately half of the total flips. Similarly, a casino might experience a loss on any
individual game, but over the long run should see a predictable profit over time.
f x ( x) =
1
√2 π σ 2
exp−
[
(x−μ)2
2 σ2 ]
with the mean and standard deviation given by µ and respectively.
X 1+ ...+ X n
( ) X n−μ
2
σ
The CLT states that: X n= → N μ, ; hence N (0 , 1)
n n σ /√n
The CLT provides the basis for much of hypothesis testing, which is discussed shortly. At a very basic
level, you can consider the implications of this theorem on coin flipping: the probability of getting
some number of heads flipped over a large n should be approximately that of a normal distribution.
Whenever you're asked to reason about any particular distribution over a large sample size, you
should remember to think of the CLT, regardless of whether it is Binomial, Poisson, or any other
distribution.
Hypothesis Testing
General Setup
The process of testing whether or not a sample of data supports a particular hypothesis is called
hypothesis testing, Generally, hypotheses concern particular properties of interest for a given
population, such as its parameters, like µ (for example, the mean conversion rate among a set of
users). The steps in testing a hypothesis are as follows:
1. State a null hypothesis and an alternative hypothesis. Either the null hypothesis will be rejected
(in favor of the alternative hypothesis), or it will fail to be rejected (although failing to reject the
null hypothesis does not necessarily mean it is true, but rather that there is not sufficient
evidence to reject it).
2. Use a particular test statistic of the null hypothesis to calculate the corresponding -value.
3. Compare the -value to a certain significance level a .
Since the null hypothesis typically represents a baseline (e.g., the marketing campaign did not
increase conversion rates, etc.), the goal is to reject the null hypothesis with statistical significance
and hope that there is a significant outcome.
Hypothesis tests are either one- or two-tailed tests. A one-tailed test has the following types of null
and alternative hypotheses:
H 0 : μ=μ 0 Versus H 1 : μ< μ 0∨H 1 : μ> μ 0
where H0 is the null hypothesis and H1 is the alternative hypothesis, and is the parameter of interest.
Understanding hypothesis testing is the basis of A/B testing, a topic commonly covered in tech
companies' interviews. In A/B testing, various versions of a feature are shown to a sample of
different users, and each variant is tested to determine if there was an uplift in the core engagement
metrics.
Say, for example, that you are working for Uber Eats, which wants to determine whether email
campaigns will increase its product's conversion rates. To conduct an appropriate hypothesis test,
you would need two roughly equal groups (equal with respect to dimensions like age, gender,
location, etc.), One group would receive the email campaigns and the other group would not be
exposed. The null hypothesis in this case would be that the two groups exhibit equal conversion
rates, and the hope is that the null hypothesis would be rejected.
Test Statistics
A test statistic is a numerical summary designed for the purpose of determining whether the null
hypothesis or the alternative hypothesis should be accepted as correct. More specifically, it assumes
that the parameter of interest follows a particular sampling distribution under the null hypothesis.
For example, the number of heads in a series of coin flips may be distributed as a binomial
distribution, but with a large enough sample size, the sampling distribution should be approximately
normally distributed. Hence, the sampling distribution for the total number of heads in a large series
of coin flips would be considered normally distributed.
Several variations in test statistics and their distributions include:
1. Z-test: assumes the test statistic follows a normal distribution under the null hypothesis
2. I-test: uses a student's t-distribution rather than a normal distribution
3. Chi-squared: used to assess goodness of fit, and to check whether two categorical variables are
independent
Z-Test
Generally the Z-test is used when the sample size is large (to invoke the CLT) or when the population
variance is known, and a t-test is used when the sample size is small and when the population
variance is unknown. The Z-test for a population mean is formulated as:
x−μ0
z= N (0 ,1)
σ /√n
in the case where the population variance σ 2 is known.
t-Test
The t-test is structured similarly to the Z-test, but uses the sample variance s 2 in place of population
variance. The t-test is parametrized by the degrees of freedom, which refers to the number of
independent observations in a dataset, denoted below by n – 1 :
x−μ0
t= t
s/√n
where
∑ (x1 −x)2
s2= 1=1
n−1
As stated earlier, the t-distribution is similar to the nomal distribution in appearance but has larger
tails (i.e., extreme events happen with greater frequency than the modeled distribution would
predict), a common phenomenon, particularly in economics and Earth sciences.
Chi-Squared Test
The Chi-squared test statistic is used to assess goodness of fit, and is calculated as follows:
2
(Oi−E i)
x =∑
2
i Ei
where Oi is the observed value of interest and E i is its expected value. A Chi-squared test statistic
takes on a particular number of degrees of freedom, which is based on the number of categories in
the distribution.
To use the squared test to check whether two categorical variables are independent, create a table
of counts (called a contingency table), with the values of one variable forming the rows of the table
and the values of the other variable forming its columns, and check for intersections. It uses the
same style of Chi-squared test statistic as given above.
Both p-values and confidence intervals are commonly covered topics during interviews. Put simply, a
p-value is the probability of observing the value of the calculated test statistic under the null
hypothesis assumptions. Usually, the p-value is assessed relative to some predetermined level of
significance (0.05 is often chosen).
^p ± z 0 /2
√ ^p (1− ^p )
n
since our estimate of the true proportion will have the following parameters when estimated as
approximately Gaussian:
np np(1− p) p (1− p)
μ= =p ,σ 2= =
n n2 n
As long as the sampling distribution of a random variable is known, the appropriate p-values and
confidence intervals can be assessed.
Knowing how to explain p-values and confidence intervals; in technical and nontechnical terms, is
very useful during interviews, so be sure to practice these. If asked about the technical details,
always remember to make sure you correctly identify the mean and variance at hand.
have the overall of the 100 tests be 0.05, and this can be done by setting the new to /n, where
n is the number of hypothesis tests (in this case, /n = 0.05/100 = 0.0005). This is known as
Bonferroni correction, and using it helps make sure that the overall rate of false positives is
controlled within a multiple testing framework.
Generally, most interview questions concerning Type I and II errors are qualitative in nature — for
instance, requesting explanations of terms or of how you would go about assessing errors/power in
an experimental setup.
The natural log of L( ) is then taken prior to calculating the maximum; since log is a monotonically
increasing function, maximizing the log-likelihood log L() is equivalent to maximizing the likelihood:
n
log L() ∑ log f (x i∨θ)
i−1
Another way of fitting parameters is through maximum a posteriori estimation (MAP), which
assumes a "prior distribution:"
θ MAP=arg max g ( θ ) f (x 1 … x n∨θ)
where the similar log-likelihood is again employed, and g() is a density function of .
Both MLE and MAP are especially relevant in statistics and machine learning, and knowing these is
recommended, especially for more technical interviews. For instance, a common question in such
interviews is to derive the MLE for a particular probability distribution. Thus, understanding the
above steps, along with the details of the relevant probability distributions, is crucial,
Medium
6.11. Google: How would you derive a confidence interval for the probability of flipping heads
from a series of coin tosses?
6.12. Two Sigma: What is the expected number of coin flips needed to get two consecutive
heads?
6.13. Citadel: What is the expected number of rolls needed to see all six sides of a fair die?
6.14. Akuna Capital: Say you're rolling a fair six-sided die. What is the expected number of rolls
until you roll two consecutive 5s?
6.15. D.E. Shaw: A coin was flipped 1,000 times, and 550 times it showed heads. Do you think the
coin is biased? Why or why not?
6.16. Quora: You are drawing from a normally distributed random variable X N(0, 1) once a day.
What is the approximate expected number of days until you get a value greater than 2?
6.17. Akuna Capital: Say you have two random variables X and Y, each with a standard deviation.
What is the variance of aX + bY for constants a and b?
6.18. Google: Say we have X Uniform(0, 1) and Y Uniform(0, 1) and the two are independent.
What is the expected value of the minimum of X and Y?
6.19. Morgan Stanley: Say you have an unfair coin which lands on heads 60% of the time. How
many coin flips are needed to detect that the coin is unfair?
6.20. Uber: Say you have n numbers !...n, and you uniformly sample from this distribution with
replacement n times. What is the expected number of distinct values you would draw?
6.21. Goldman Sachs: There are 100 noodles in a bowl. At each step, you randomly select two
noodle ends from the bowl and tie them together. What is the expectation on the number
of loops formed?
6.22. Morgan Stanley: What is the expected value of the max of two dice rolls?
6.23. Lyft: Derive the mean and variance of the uniform distribution U(a, b).
6.24. Citadel: How many cards would you expect to draw from a standard deck before seeing the
first ace?
6.25. Spotify: Say you draw n samples from a uniform distribution U(a, b). What are the MLE
estimates of a and b?
Hard
6.26. Google: Assume you are drawing from an infinite set of i.i.d random variables that are
uniformly distributed from (0, 1). You keep drawing as long as the sequence you are getting
is monotonically increasing. What is the expected length of the sequence you draw?
6.27. Facebook: There are two games involving dice that you can play. In the first game, you roll
two dice at once and receive a dollar amount equivalent to the product of the rolls. In the
second game, you roll one die and get the dollar amount equivalent to the square of that
value. Which has the higher expected value and why?
6.28. Google: What does it mean for an estimator to be unbiased? What about consistent? Give
examples of an unbiased but not consistent estimator, and a biased but consistent
estimator.
6.29. Netflix: What are MLE and MAP? WhaL is the difference between the two?
6.30. Uber: Say you are given a random Bernoulli trial generator. How would you generate values
from a standard normal distribution?
6.31. Facebook: Derive the expectation for a geometric random variable.
6.32. Goldman Sachs: Say we have a random variable X D, where D is an arbitrary distribution.
What is the distribution F(X) where F is the CDF of X?
6.33. Morgan Stanley: Describe what a moment generating function (MGF) is. Derive the MGF for
a normally distributed random variable X.
6.34. Tesla: Say you have N independent and identically distributed draws of an exponential
random variable. What is the best estimator for the parameter X?
6.35. Citadel: Assume that log X N(0, 1). What is the expectation of X?
6.36. Google: Say you have two distinct subsets of a dataset for which you know their means and
standard deviations. How do you calculate the blended mean and standard deviation of the
total dataset? Can you extend it to K subsets?
6.37. Two Sigma: Say we have two random variables X and Y. What does it mean for X and Y to be
independent? What about uncorrelated? Give an example where X and Y are uncorrelated
but not independent.
6.38. Citadel: Say we have X Uniform (-1, 1) and Y=X^2. What is the covariance of. X and Y?
6.39. Lyft: How do you uniformly sample points at random from a circle with radius R?
6.40. Two Sigma: Say you continually sample from some i.i.d. uniformly distributed (0, 1) random
variables until the sum of the variables exceeds 1. How many samples do you expect to
make?
sufficiently large; we can assess the statistical properties of the total number of bookings, as well as
the booking rate (rides booked / rides opened on app). These statistical properties play a key role in
hypothesis testing, allowing companies like Uber to decide whether or not to add new features in a
data-driven manner.
Solution #6.2
Suppose we want to estimate some parameters of a population. For example, we might want to
estimate the average height of males in the U.S. Given some data from a sample, we can compute a
sample mean for what we think the value is, as well as a range of values around that mean.
Following the previous example, we could obtain the heights of 1,000 random males in the U.S. and
compute the average height, or the sample mean. This sample mean is a type of point estimate and,
while useful, will vary from sample to sample. Thus, we can't tell anything about the variation in the
data around this estimate, which is why we need a range of values through a confidence interval.
Confidence intervals are a range of values with a lower and an upper bound such that if you were to
sample the parameter of interest a large number of times, the 95% confidence interval would
contain the true value of this parameter 95% of the time. We can construct a confidence interval
using the sample standard deviation and sample mean. The level of confidence is determined by a
margin of error that is set beforehand. The narrower the confidence interval, the more precise the
estimate, since there is less uncertainty associated with the point estimate of the mean.
Solution #6.3
A/B testing has many possible pitfalls that depend on the particular experiment and setup
employed. One common drawback is that groups may not be balanced, possibly resulting in highly
skewed results. Note that balance is needed for all dimensions of the groups — like user
demographics or device used — because, otherwise, the potentially statistically significant results
from the test may simply be due to specific factors that were not controlled for. Two types of errors
are frequently assessed: Type I error, which is also known as a "false positive," and Type II error, also
known as a "false negative." Specifically, Type error is rejecting a null hypothesis when that
hypothesis is correct, whereas Type II error is failing to reject a null hypothesis when its alternative
hypothesis is correct.
Another common pitfall is not running an experiment for long enough. Generally speaking,
experiments are run with a particular power threshold and significance threshold; however, they
often do not stop immediately upon detecting an effect- For an extreme example, assume you're at
either Uber or Lyft and running a test for two days, when the metric of interest (e.g., rides booked) is
subject to weekly seasonality.
Lastly, dealing with multiple tests is important because there may be interactions between results of
tests you are running and so attributing results may be difficult. In addition, as the number of
variations you run increases, so does the sample size needed. In practice, while it seems technically
feasible to test 1,000 variations of a button when optimizing for click-through rate, variations in tests
are usually based on some intuitive hypothesis concerning core behavior.
Solution #6.4
For any given random variables X and Y, the covariance, a linear measure of relationship, is defined
by the following: Cov ( X , Y ) =E [ (X −E|X|)(Y −E|Y |) ] =E| XY|−E| X| E|Y |
Specifically, covariance indicates the direction of the linear relationship between X and Y and can
take on any potential value from negative infinity to infinity. The units of covariance are based on the
units of X and Y, which may differ.
The correlation between X and Y is the normalized version of covariance that takes into account the
variances of X and Y:
Cov ( X ,Y )
ρ ( X , Y )=
√ Var ( X ) Var ( Y )
Since correlation results from scaling covariance, it is dimensionless (unlike covariance) and is always
between -1 and 1 (also unlike covariance).
Solution #65
The null hypothesis is that the coin is fair, and the alternative hypothesis is that the coin is biased
towards tails (note this is a one-sided test):
H0 : P0 = 0.5, H1 : P1 < 0.5
Note that, since the sample size here is 10, you cannot apply the Central Limit Theorem (and so you
cannot approximate a binomial using a normal distribution).
The p-value here is the probability of observing the results obtained given that the null hypothesis is
true, i.e., under the assumption that the coin is fair. In total for 10 flips of a coin, there are 2 ^10 =
1024 possible outcomes, and in only 10 of them are there 9 tails and one heads. Hence, the exact
10
probability of the given result is the p-value, which is =0.0098 . Therefore, with a significance
1024
level set, for example, at 0.05, we can reject the null hypothesis.
Solution #6.6
The process of testing whether data supports particular hypotheses is called hypothesis testing and
involves measuring parameters of a population's probability distribution. This process typically
employs at least two groups one a control that receives no treatment, and the other group(s), which
do receive-the treatment(s) of interest. Examples could be the height of two groups of people, the
conversion rates for particular user flows in a product, etc. Testing also involves two hypotheses —
the null hypothesis, which assumes no significant difference between the groups, and the alternative
hypothesis, which assumes a significant difference in the measured parameter(s) as a consequence
of the treatment.
A p-value is the probability of observing the given Lest results under the null hypothesis
assumptions. The lower this probability, the higher the chance that the null hypothesis should be
rejected. If the p-value is lower than the predetermined significance level d, generally set at 0.05,
then it. indicates that the null hypothesis should be rejected in favor of the alternative hypothesis.
Otherwise, the null hypothesis cannot be rejected, and it cannot be concluded that the treatment
has any significant effect.
Solution #6.7
Both errors are relevant in the context of hypothesis testing. Type I error is when one rejects the null
hypothesis when it is correct, and is known as a false positive. Type II error is when the null
hypothesis is not rejected when the alternative hypothesis is correct; this is known as a false
negative. In layman 's terms, a type I error is when we detect a difference, when in reality there is no
significant difference in an experiment. Similarly, a type II error occurs when we fail to detect a
difference, when in reality there is a significant difference in an experiment.
Type I error is given by the level of significance u, whereas the type II error is given by . Usually, I-
is referred to as the confidence level, whereas I- is referred to as the statistical power of the test
being conducted. Note that, in any well-conducted statistical procedure, we want to have both and
be small. However, based on the definition of the two} it is impossible to make both errors small
simultaneously: the larger is, the smaller is. Based on the experiment and the relative
importance of false positives and false negatives, a data scientist must decide what thresholds to
adopt for any given experiment. Note that experiments are set up so as to have both I-a and I-
relatively high (say at .95, and .8, respectively).
Solution #6.8
Power is the probability of rejecting the null hypothesis when, in fact, it is false. It is also the
probability of avoiding a Type II error. A Type II error occurs when the null hypothesis is not rejected
when the alternative hypothesis is correct. This is important because we want to detect significant
effects during experiments. That is, the higher the statistical power of the test, the higher the
probability of detecting a genuine effect (i.e., accepting the alternative hypothesis and rejecting the
null hypothesis). A minimum sample size can be calculated for any given level of power — for
example, say a power level of 0.8.
An analysis of the statistical power of a test is usually performed with respect to the test's level of
significance () and effect size (i.e., the magnitude of the results).
Solution #6.9
In a Z-test, your test statistic follows a normal distribution under the null hypothesis. Alternatively, in
a I-test, you employ a student's t-distribution rather than a normal distribution as your sampling
distribution.
Considering the population mean, we can use either Z-test or t-test only if the mean is normally
distributed, which is possible in two cases: the initial population is normally distributed, or the
sample size is large enough (n 30) that we can apply the Central Limit Theorem.
If the condition above is satisfied, then we need to decide which type of test is more appropriate to
use. In general, we use Z-tests if the population variation is known, and vice versa: we use t-test if
the population variation is unknown.
Additionally, if the sample size is very large (n > 200), we can use the Z-test in any case, since for such
large degrees of freedom, t-distribution coincides with z-distribution up to thousands.
Considering the population proportion, we can use a Z-test (but not t-lest) where np0 10 and n(1-
P0) 10, i.e., when each of the number of successes and the number of failures is at least 10.
Solution #6.10
The primary consideration is that, as the number of tests increases, the chance that a stand-alone p-
value for any of the t-tests is statistically significant becomes very high due to chance alone As an
example, with 100 tests performed and a significance threshold of = 0.05, you would expect five of
the experiments to be statistically significant due only to chance. is, you have a very high probability
of observing at least one significant outcome. Therefore, the chance of incorrectly rejecting a null
hypothesis (i.e., committing Type I error) increases.
To correct for this effect, we can use a method called the Bonferroni correction, wherein we set the
significance threshold to /m, where m is the number of tests being performed. In the above
scenario With 100 tests, we can set the significance threshold to instead be 0.05/100 = 0.0005.
While this correction helps to protect from Type I error, it is still prone to Type II error (i.e., failing to
reject the null hypothesis when it should be rejected). In general, the Bonferroni correction is mostly
useful when there is a smaller number of multiple comparisons of which a few are significant. If the
number of tests becomes sufficiently high such that many tests yield statistically significant results,
the number of Type II errors may also increase significantly.
Solution #6.11
The confidence interval (CI) for a population proportion is an interval that includes a true population
proportion with a certain degree of confidence 1 — a.
For the case of flipping heads from a series of coin tosses, the proportion follows the binomial
distribution, If the series size is large enough (each of the number of successes and the number of
failures is at least 10), we can utilize the Central Limit Theorem and use the normal approximation
for the binomial distribution as follows:
(
N ^p ,
^p (1− ^p )
n )
where is the proportion of heads tossed in series, and n is the series size. The Cl is centered at the
series proportion, and plus or minus a margin of error:
^p ± z 0 /2
√ ^p (1− ^p )
n
where ^p is the appropriate value from the standard nomal distribution for the desired confidence
level.
For example, for the most commonly used level of confidence, 95%, z 0 /2 = 1.96.
Solution #6.12
Let X be the number of coin flips needed to obtain two consecutive heads. We then want to solve for
E[X]. Let H denote a flip that results in heads, and T denote a flip that results in tails, Note that E[X]
can be written in terms of E[X|H] and E[X|T], i.e., the expected number of flips needed, conditioned
on a flip being either heads or tails, respectively.
1 1
Conditioning on the first flip, we have: E [ X ] = ( 1+ E [ X ¿ ] ) + ( 1+ E [ X ¿ ] )
2 2
Note that E[X|T] = C since if a tail is flipped, we need to start over in getting two heads in a row.
To solve for E[X|H], we can condition it further on the next outcome: either heads (HH) or tails (HT),
1 1
Therefore, we have: E [ X ¿ ] = ( 1+ E [ X ¿ ] ) + ( 1+ E [ X ¿ ] )
2 2
Note that If the result is HH, then E[X|HH] = 0, since the outcome has been achieved. If a tail was
flipped, then E [ X ¿ ] =E ¿ , and we need to start over in attempting to get two heads in a row. Thus:
1 1 1
E [ X ¿ ] = ( 1+0 )+ ( 1+ E [ X ] ) =1+ E [ X ]
2 2 2
Plugging this into the original equation yields:
E [ X ]=
1
2 ( 1 +1
)
1+1+ E [ X ] + ( 1+ E [ X ] )
2 2
and after solving we get: E[X] = 6. Therefore, we would expect 6 flips.
Solution #6.13
Let k denote the number of distinct sides seen from rolls. The first roll will always result in a new side
being seen. If you have seen k sides, where k < 6, then the probability of rolling an unseen value will
be (6 - k)/6, since there are 6 - k values you have not seen, and 6 possible outcomes of each
roll.
Note that each roll is independent of previous rolls. Therefore, for the second roll (k = 1), the time
until a side not seen appears has a geometric distribution with p = 5/6, since there are five of the six
sides left to be seen. Likewise, after two sides (k = 2), the time taken is a geometric distribution, with
p 4/6. This continues until all sides have been seen,
Recall that the mean for a geometric distribution is given by lip, and let X be the number of rolls
needed to show all six sides. Then, we have the following:
6
6 6 6 6 6 1
E [ X ] =1+ + + + + =6 ∑ =14.7 rolls
5 4 3 2 1 p =1 p
Solution #6.14
Similar in methodology to question 13, let X be the number of rolls until two consecutive fives. Let Y
denote the event that a five was just rolled.
Conditioning on Y, we know that either we just rolled a five, so we only have one more five to roll, or
we rolled some other number and now need to start over after having rolled once:
1 5
E [ X ] = (1+ E [X ∨Y ])+ ( 1+ E [ X ] )
6 6
1 5
Note that we have the following: E [X ∨Y ]= (1)+ (1+ E [ X ] )
6 6
Plugging the results in yields an expected value of 42 rolls: E [ X ] = 42
Solution #6.15
Because the sample size of flips is large (1,000), we can apply the Central Limit Theorem. Since each
individual flip is a Bernoulli random variable, we can assume that p is the probability Of getting
heads. We want to test whether p is .5 (i.e., Whether it is a fair coin or not). The Central Limit
Theorem allows us to approximate the total number of heads seen as being normally distributed.
More specifically, the number of heads seen out of n total rolls follows a binomial distribution since
it a sum of Bernoulli random variables. If the coin is not biased (p = .5), then the expected number of
heads is as follows: µ = np=¿1000 * 0.5 = 500, and the variance of the number of heads is given by:
σ =np ( 1− p ) =1000∗0.5∗0.5=250 , σ =√ 250=16
2
Since this mean and standard deviation specify the normal distribution, we can calculate the
corresponding z-score for 550 heads as follows:
550−500
z= =3.16
16
This means that, if the coin were fair, the event of seeing 550 heads should occur with a < 0.1%
chance under normality assumptions. Therefore, the coin is likely biased.
Solution #6.16
Since X is normally distributed, we can employ the cumulative distribution function (CDF) of the
normal distribution: = ( 2 )=P ( X ≤ 2 )=P( X ≤ μ+2 σ )=0.9772
Therefore, P(X> 2) = 1 - 0.977 = 0.023 for any given day. Since each day's draws are independent, the
expected time until drawing an X > 2 follows a geometric distribution, with p = 0.023. Letting T be a
random variable denoting the number of days, we have the following:
1 1
E [ T ]= = =44 days
p .02272
Solution #6.17
Let the variances for X and Y be denoted by Var (X) and Var (Y).
Then, recalling that the variance of a sum of variables is expressed as follows:
Var ( X+ Y )=Var (X )+Var (Y )+2Cov ( X , Y )
and that-a constant coefficient of a random variable is assessed as follows: Var ( aX )=a2 Var ( X )
We have Var (aX+ bY )=a 2 Var (X )+b 2 Var (Y )+2 abCov (X , Y ) ,which would provide the bounds
on the designated variance; the range will depend on the covariance between X and Y.
Solution #6.18
Let Z = min(X,Y). Then we know the following: P ( Z ≤ z ) =P ( min ( X ,Y ) ≤ z )=1− p( X > z ,Y > z )
For a uniform distribution, the following is true for a value of z between 0 and 1:
P ( X > z )=1−z and P ( Y > z )=1−z
Since X and Y are i.i.d., this yields: P ( Z ≤ z ) =1−P ( X > z ,Y > z ) =1−(1−z)2
Now we have the cumulative distribution function for z. We can get the probability density function
by taking the derivative of the CDF to obtain the following: fz(z )=2(1−z ). Then, solving for the
expected value by taking the integral yields the following:
1 1
E [ Z ] =∫ zfz ( z ) dz=2 ∫ z ( 1−z ) dz =2
0 0
( 12− 13 )= 13
Therefore, the expected value for the minimum of X and Y is 1/3.
Solution #6.19
Say we flip Ibe unfair coin n times. Each flip is a Bernoulli trial with a success probability of p:
x 1 , x 2 ,… x n , x i Ber ( p)
We can construct a confidence interval for p as follows, using the Central Limit Theorem. First, we
decide on our level of confidence, If we select a 95% confidence level, the necessary z-score is z 1.96.
We then construct a 95% confidence interval for p. If it does not include 0.5 as its lower bound, then
we can reject the null hypothesis that the coin is fair.
Since the trials are i.i.d., we can compute the sample mean for p from a large number of trials:
n
1
^p= ∑x
n i=0 i
np np(1− p) p (1− p)
We know the following properties hold: E [ ^p ] = = p and Var= ( ^p )= =
n n2 n
Solution #6.20
Let the following be an indicator random variable: Xi = 1 if i is drawn in n turns
We would then want to find the following:
n
∑ E [ Xi]
i=0
We know that p(Xi = 1) = 1 - p(Xi = 0), so the probability of a number not being drawn (where each
draw is independent) is the following:
( )
n
n−1
p ( X i=0 )=
n
( )
n
n−1
Therefore, we have: p ( X i=1 )=1− and by linearity of expectation, we then have:
n
( ( ))
n n
n−1
∑ E [ X i ]=n E [ X i ]=n 1− n
i=0
Solution #6.21
Say that we have n noodles. At any given step, we will have one of two outcomes: (1) we pick two
ends from the same noodle (which makes a loop), or (2) we pick two ends from different noodles.
Let Xn denote a random variable representing the number of loops with n noodles remaining.
n 1
=
The probability of case (1) happening is: 2 n
n ( )
2 n−1
where the denominator represents the number of ends we can choose from the noodles, and the
numerator represents the number of cases where we choose the same noodle.
1 2 n−2
Therefore, the probability of case (2) happening is: 1− =
2 n−1 2 n−1
Then, taking case (1) and (2), we have the following recursive formulation för the expectation of the
number of loops formed:
1 2 n−2
E [ X n ]= + E [ X n−1 ]
2 n−1 2 n−1
Plugging in E [ X 1 ] = 1 and calculating the first few terms, we can notice the following pattern, for
which we can plug in n = 100 to obtain the answer:
1 1
E [ X 100 ]=1+ + ...+ ≈ 3.3
3 2 ( 100 )−1
Solution #6.22
Since we only have two dice, let the maximum value between the two be m. Let
X 1 , X 2 ,Y =max (X 1 , X 2)
denote the first roll, second roll, and the max of the two. Then we want to find the following:
6
E [ Y ] =∑ i∗P (Y =1)
i=1
We can condition Y= m on three cases: (1) die one is the max roll; (2) die two is the max roll; or (3)
they are both the same.
1
∗i−1
For cases (1) and (2) we have: 6
P ( X 1=i , X 2 <i )=P ( X 2=i, X 1 <i ) =
6
"For case (3), where both dice are the maximum:"
1
∗1
6
P ( X 1= X 2=i )=
6
( )
1 1
∗i−1 6 ∗1
putting everything together yields the following: 6 6 161
E [ Y ] =∑ i∗ ∗2+ =
i=1 6 6 36
A simpler way to visualize this is to use a contingency table, such as the one below:
1 2 3 4 5 6
(1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(6, 1) (6, 2) (6, 3) (6, 4) (6, 5) (6, 6)
Solution #6.23
1
For X – U(a, b), we have the following: fx( x)= b−a
Therefore, we can calculate the mean as:
|
b b 2
x x b a+b
E [ X ] =∫ xf x ( x ) dx=∫ dx= =
a a b−a 2 (b−a) a 2
Giving us:
|
b b 2 3 2 2
x x b a +ab +b
E [ X ]=∫ xf x ( x ) dx=∫
2
dx= =
a a b−a 3(b−a) a 3
( )
2
a 2+ ab2 +b2 a+b 2 (b−a)
Therefore: Var ( X )= − =
3 2 12
Solution #6.24
Although one can enumerate all the probabilities, this can get a bit messy from an algebraic
standpoint, so obtaining the following intuitive answer is more preferable. Imagine we have aces A1,
A2, A3, A4. We can then draw a line in between them to represent an arbitrary number (including 0)
of cards between each ace, with a line before the first ace and after the last.
|A1|A2|A3|A4|
There are 52 - 4 = 48 non-ace cards in a deck. Each of these cards is equally likely to be in any of the
five lines. Therefore, there should be 48/5 = 9.6 cards drawn prior to the first ace being drawn,
Hence, the expected number of cards drawn until the first ace is seen is 9.6 + 1 = 10.6 cards — we
can't forget to add 1, because we need to include drawing the ace card itself.
Solution #6.25
1
Note that for a uniform distribution, the probability density is for any value on the interval [a,
b−a
b], The likelihood function is therefbre as follows:
( )
n
1
f ( x 1 … , xn| a ,b ¿=
b−a
To obtain the MLE, we maximize this likelihood function, which is clearly maximized if b is the largest
of the samples and a is the smallest of the samples. Therefore, we have the following:
a^ =min ( x 1 … , x n )= b^ max ( x 1 … , x n )
Solution #6.26
Assume that we have an indicator random variable: X i = 1 if the sequence is increasing up to ith
element, and otherwise Xi = 0.
Then, we calculate the expectation: E[X 1 + X2 + ...]. Consider some arbitrary i. In order to draw up to
element i, the entire sequence up to i must be monotonically increasing, which means that the
following is true: X1 < X2 < ...< Xi. Given that there are n possible sequences of the elements, there is a
1
i!
1
chance of Xi being 1. Since each X is i.i.d., we then have: E [ X 1+ X 2+... ] =1+ +...=e−1
2!
Solution #6.27
One method of solving this problem is brute force method, which consists of computing the
expected values by listing all of the outcomes and associated probabilities and payoffs. However,
there exists an easier way of solving the problem.
Assume that the outcome of the roll of a die is given by a random variable X (meaning that it takes
on the values 1...6 with equal probability). Then, the question is equivalent to asking, "What is E[X] *
E[X] = E[X]2 (i.e., the expected value of the product of two separate rolls), versus E[X 2] (the expected
value of the square of a single roll)?"
Recall that the variance of a given random variable X is as follows:
Typically, this variance term is exactly the difference between the two sets of die rolls — the two
“games" (the payoff of the second game minus the payoff of the first game). Since the left-hand side
is positive, as expected for the value of a squared number, then the right-hand side is also positive,
Therefore, it must be the case that the second game has a higher expected value than the first.
Solution #6.28
In both cases, we are dealing with an estimator of the true parameter value. An estimator is
unbiased if the expectation of the estimator is the true underlying parameter value. An estimator is
consistent if, as the sample size increases, the estimator's sampling distribution converges towards
the true parameter value.
Consider the following random variable X, which is normally; distributed, and n i.i.d. samples used to
calculate a sample mean:
x 1+ x 2 …+ x n
X N ( μ , σ 2 ) and x=
n
The first sample is an example of an unbiased but not consistent estimator. It is unbiased since
E [ x 1 ]=μ. However, it is not consistent since, as the sample size increascs, the sampling distribution
of the first sample does not become more concentrated with respect to the true mean.
n
1
An example of a biased but consistent estimator is the sample variance: Sn=
2
∑
n i=1
( x i−x )
2
n−1 2
It can be shown that E [ Sn ]=
2
σ
n
The formal proof of the above is called Bessel's correction, but there is an intuitive way to grasp the
presence of the term preceding the variance. If we uniformly sample two numbers randomly from
the series of numbers 1 to n, we have an n/n2 = 1/n chance that the two equal the same number,
meaning the sampled squared difference of the numbers will be zero. The sample variance will
therefore slightly underestimate the true variance. However, this bias goes to 0 as n approaches
infinity, since the term in front of the variance, (n-1/n), approaches 1. Therefore, the estimator is
consistent.
Solution #6.29
MLE stands for maximum likelihood estimation, and MAP for maximum a posteriori. Both are ways
of estimating variables in a probability distribution by producing a single estimate of that variable.
Assume that we have a likelihood function P(X|). Given n i.i.d. samples, the MLE is as follows:
n
MLE ( θ )=max P(X ∨θ)=max ∏ P ( x i∨θ)
0 0 i
Since the product of multiple numbers all valued between 0 and 1 might be very small, maximizing
the log function of the product above is more convenient. This is an equivalent problem, since the
log function is monotonically increasing. Since the log of a product is equivalent to the sum of logs,
the MLE becomes the following:
n
MLE log ( θ ) =max ∑ log P(x i∨θ)
0 i=1
Relying on Bayes rule, MAP uses the posterior P(|X) being proportional to the likelihood multiplied
by a prior P(), i.e., P(|X) P(), The MAP for is thus the following:
n
MAP ( θ )=max P(X ∨θ)=max ∏ P( xi ∨θ)P(θ)
0 0 i
Employing the same math as used in calculating the MLE, the MAP becomes:
n
MAP log ( θ )=max ∑ log P(x i∨θ)+log P(θ)
0 i=1
Therefore, the only difference between the MLE and MAP is the inclusion of the prior in MAP;
otherwise, the two are identical. Moreover, MLE can be seen as a special case of the MAP with a
uniform prior.
Solution #6.30
Assume we have Bernoulli trials, each With a p probability success. Altogether, they form a binomial
distribution: x1,x2,…,xn, X B(n,p) where xi = 1 means success and xi = 0 means failure. Assuming
i.i.d, trials, we can compute the sample proportion for ^p as follows:
n
1
^p= ∑x
n i=1 i
We know that if n is large enough, then the binomial distribution approximates the following normal
distribution:
^p=N p ,
( p(1− p)
n )
where n must be np ≥ 10 , n(1− p)≥10
Therefore, the value ^p can be used as simulation for a normal distribution. The sample size n must
only be large enough to satisfy the conditions above (at least n = 20 for p = .5), but it is
recommended to use a significantly larger n to get the better normal approximation.
^p − p
^p : ^p 0=
Finally, to simulate the standard normal distribution, we normalize
√ p(1−p)
n
1
n
∑ x −p
n i=1 i
At this point, we can derive the final formula for our normal random generator: x=
n
√ p(1−p)
n
Solution #6.31
We are seeking the expected value of geometric random variable X as follows:
∞
E [ X ] =∑ k f x ( k )
k=1
The expression above contains a summation instead of an integral since k is a discrete rather than
continuous random variable, and we know the probability mass function of the geometric
probability distribution is given by the following: fx(k )=(1− p)k−1 p
k=1
Note that the term inside the summation is really the following:
∞ ∞ ∞
∑ k (1− p ) k−1
=∑ k (1−p )
k−1
+∑ k ( 1− p )
k−1
+.. .
k =1 k=1 k=2
k =1
Plugging this back into the equation for the expected value of X yields the following:
p∗1 1
E [ X ]= = Solution #6.32
p
2
p
We can define a new variable Y= F(X), and, hence, we want to find the CDF of y (where y is between
0 and 1 by definition of a CDF): F y ( y ) =P(Y ≤ y )
Substituting in for Y yields the following: F y ( y ) =P(F (X )≤ y )
Applying the inverse CDF on both sides yields the following:
Solution #6.33
A moment generating function is the following function for a given random variable:
sX
M x ( s )=E [e ]
If X is continuous (as in the case of normal distributions), then the function becomes the following:
∞
M x ( s )= ∫ e f x ( x ) dx
sX
−∞
Hence, the moment generating function is a function for a given value of s. It is useful for calculating
moments, since taking derivatives of the moment generating function and evaluating at s = 0 yields
the desired moment.
e2( π )
2
−1 x−μ
1
For a normal distribution, recall that: f x ( x ) =
σ √2 π
First, taking the special case of the standard normal random variable, we have the following:
−1 2
1 x
f x ( x) = e 2
√2 π
Plugging this into the above MGF yields:
∞ −1 2 ∞ −1 2
1 x 1 x + sx
M x ( s )= ∫ e ∫e
sx
e 2
dx= 2
dx
−∞ √2 π √ 2 π −∞
2 2 2
s ∞ −1 2 s2 s ∞ −(x−s)2 s
1 x +sx − 1
Completing the square yields: M x ( s )=e ∫e
√2 π −∞
2 2 2
dx=e 2
∫e
√ 2 π −∞
2
dx=e 2
Note that the last step uses the fact that the expression within the integral is a PDF for a normally
distributed random variable with mean s and variance 1, and hence the integral evaluates to 1.
To solve for a general random variable, you can plug in X = Y + µ, where Y is standard normal
2 2
Solution #6.34
Denote the n i.i.d, draws as: x1,x2,…,xn where, for any individual draw, we have the pdf:
− xi
f x ( x i) =e
( )
n n
L (❑i x 1 … x n) =∏ f x ( x i )=❑ exp −∑ x i
n
i=1 i=1
Taking the log of the equation above to obtain the log-likelihood results in the following:
n
log L (❑i x 1 … x n )=n log ()−¿ ∑ xi ¿
i=1
n
n
Taking the derivative with respect to and setting the results to 0 yields: −∑ x i=0
❑ i=1
^ n
❑= n
Therefore, the best estimate of is given by:
∑ xi
i=1
Solution #6.35
Define Y = log X. We then want to solve for: E[eY] = E[X]
sY
Recall that a moment generating function has the following form: M y ( s )=E [e ]
Therefore, we want the moment generating function for Y N(0, 1), which was derived in problem
2
Solution #6.36
Say that the two have two distinct group sizes: n1 size of group 1, and n2 = size of group 2.
Given the means of two groups, µ1 and µ2, the blended mean can be found simply by taking a
weighted average:
n 1 μ1 +n2 μ 2
μ=
n1 +n2
We know that the blended standard deviation for the total data set has the form:
√
n1+n2
∑ (z 1−μ)2
i=1
s=
n1 + n2
where z, is the union of the points from both groups.
However, since we are not given the initial data points from the two groups, we have to rearrange
2 2
this formula by using instead the given variations of these groups, s1 and s 2, as follows:
s=
√
n1 s 21+ n2 s 22+ n1 ( y 1− y )2+ n2 ( y 2− y )2
n1 +n2
Applying the Bessel correction, the blended standard deviation for the two groups is as follows:
√
s= (n¿¿ 1−1)s +
(n¿¿ 2−1)s 22 +n1 (μ1−μ)2 +n2 (μ2 −μ)2
2
1
n1+ n2−1
To extend the definition above to subsets, the mean is as follows:
¿¿
∑ n i μi
μ= i=1K
∑ ni
i=1
√
K
Solution #6.37
Independence is defined as follows: P( X=x ,Y = y)=P ( X=x ) P (Y = y ) for all x, y. Equivalently,
we can use the following definitions: P( X=x∨Y = y)=P ( X=x ) , P ¿
When two random variables X and Y are uncorrelated, their covariance, which is calculated as
follows, is 0: Cov ( X , Y ) =E[ XY ]−E [ X ] E[Y ]
For an example of uncorrelated but not independent variables, let X take on values -1, 0, or 1 with
equal probability, and let Y = 1 if X = 0 and Y = 0 otherwise. Then we can verify that X and Y are
uncorrelated:
1 1 1
E ( XY ) = (−1 )( 0 )+ ( 0 ) ( 1 )+ ( 1 )( 0 )=0
3 3 3
And E[X] = 0, so the covariance between the two random variables is zero. However, it is clear that
the two are not independent, since we defined Y in such a way that it obviously depends on X.
P¿
For example, P(Y = 1 | X = 0) = 1
Solution #6.38
[
By definition of the covariance, we have: Cov ( X , Y ) =Cov ( X , X 2 ) =E ( X −E [ X ]) ( X 2−E [ X 2 ] ) ]
Expanding terms of the equation above yields:
Cov ( X , Y ) =E ¿
Using linearity of expectation, we obtain
Cov ( X , Y ) =E [ X 3 ] −E [ X ] E [ X 2 ]−E [ X 2 ] E [ X ] + E [ X ] E [ X 2 ]
Since the second and last terms cancel one another, we end up with the following:
Cov ( X , Y ) =E [ X 3 ] −E [ X 2 ] E [ X ]
Here, we conclude that E[X] = 0 (based on the definition of X) and that E [ X 3 ] = 0 by evaluating the
probability density function of X as follows:
1 1 1
f x ( x) = = =
b−a 1−(1) 2
1 1
1 3
Since we are evaluating X from -1 to 1, we then have: E [ X ] =∫ x f ( x ) dx=∫
3 3
x dx=0
−1 −1 2
Solution #6.39
This can be proved using the inverse-transform method, whereby we sample from a uniform
distribution and then simulate the points on the circle employing the inverse cumulative distribution
functions (i.e., inverse CDFs).
We can define a random point within the circle using a given radius value and an angle (and obtain
the corresponding x, y values from polar coordinates). To sample a random radius, consider the
following. If we sample points from a radius r, we know that there are 2r points to consider (i.e., the
circumference of the circle). Likewise, if we sample a radius 2r, there are 4r points to consider.
Therefore, we have the following probability density function given by the following:
2r
f R (r )= 2
R
2
r
This follows from the CDF, which is given by the ratio of the areas of the two circles: f R ( r ) = 2
R
2
r
Therefore, for the inverse sampling, we want the following: y ¿ 2
R
This simplifies to the following: √ R2 y=r
Therefore, we can sample Y U(0, 1) and the corresponding radius will be the following:
r =R √ y
For the corresponding angles, we can sample theta uniformly from the range 0 to 2: ϵ [0, 2] and
then set the following: x = r cos(), y = r sin()
Solution #6.40
n
Let us define: Nt = smallest n such that: ∑ U i> t for any value t between 0 and 1. Then we want to
i=1
find: m(t) = E[Nt]
Consider the first draw. Assuming that result is some value x, we then have two cases as follows. The
first is that x > t, in which Nt = 1
The second is that x < t, necessitating that we sample again, yielding: N t =1+ N t −x
1
Putting these two together, we have: m ( t ) =1+∫ m( t−x) dx
0
How much machine learning do you actually need to know to land a top job in Silicon Valley
or Wall Street? Probably less than you think! From coaching hundreds of data folks on the
job hunt, one of the most common misconceptions we saw was candidates thinking their
lack of deep learning expertise would tank their performance in data science interviews.
However, The truth is that most data scientists are hired to solve business problems — not
blindly throw complicated neural networks on top of dirty data. As such, a data scientist with
strong business intuition can create more business value by applying linear regression in an
Excel sheet than a script kiddie whose knowledge doesn’t extend beyond the Keras API.
Son unless you're interviewing for ML Engineering or research scientist roles, a solid
understanding off the classical machine learning techniques covered in this chapter is all you
need to ace the data science interview. However, if you are aiming for ML-heavy roles that
do require advanced knowledge, this chapter will still be handy! Throughout this chapter, we
frequently call attention to which topics and types of questions show up in tougher ML
interviews. Plus, the 35 questions at the end of the chapter — especially the hard ones —
will challenge even the most seasoned ML practitioner.
to conceptual knowledge). As such, if you have job experience that is directly relevant, interviewers
will often ask about that. If not, they'll often fall back to asking about your projects.
While anything listed on your resume is fair game to be picked apart, this is especially true for more
ML-heavy roles. Because the field is so vast and continually evolving, an interviewer isn't able to
assess your fit for the job by asking about some niche topic unrelated to the position at hand. For
example, say you are going for a general data science role it's not fair to ask a candidate about CNNs
and their use in computer vision if they have no experience with this topic and it's not relevant to
the job. But, suppose you hacked together a self-driving toy car last summer, and listed it on your
resume. In that case — even though the role at hand may not require computer vision — it's totally
fair game to be asked more about the neural network architecture you used, model training issues
you faced, and trade-offs you made versus other techniques. Plus, in an effort to see if you know the
details not just of your project, but of the greater landscape, you'd also be expected to answer
questions tangentially related to the project.
Linear Algebra
The main linear algebra subtopic worth touching on for interviews is eigenvalues and eigenvectors.
Mechanically, for some n × n matrix A, x is an eigenvector of A if: Ax=x, where is a scalar. A
matrix can represent a linear transformation and, when applied to a vector x, results in another
vector called an eigenvector, which has the same direction as x and is in fact x multiplied by a scaling
factor known as an eigenvalue.
The decomposition of a square matrix into its eigenvectors is called an eigendecomposition.
However, not all matrices are square. Non-square matrices are decomposed using a method called
singular value decomposition (SVD). A matrix to which SVD is applied has a decomposition of the
form: A=U ∑ V T , where U is an m ×m matrix, is an m ×nmatrix, and V is an n × nmatrix.
There are many applications of linear algebra in ML, ranging from the matrix multiplications during
backpropagation in neural networks, to using eigendecomposition of a covariance matrix in PCA. As
such, during technical interviews for ML engineering and quantitative finance roles-you should be
able to whiteboard any follow-up questions on the linear algebra concepts underlying techniques
like PCA and linear regression. Other linear algebra topics you're expected to know are core building
blocks like vector spaces, projections, inverses, matrix transformations, determinants,
orthonormality, and diagonalization.
Gradient Descent
Machine learning is concerned with minimizing some particular objective function (most commonly
known as a loss or cost function). A loss function measures how well a particular model fits a given
dataset, and the lower the cost, the more desirable. Techniques to optimize the loss function are
known as optimization methods.
One popular optimization method is gradient descent, which takes small steps in the direction of
steepest descent for a particular objective function. It's akin to racing down a hill. To win, you always
take a “next step” in the steepest direction downhill.
Cost
Gradient
Minimum Cost
Weight
For convex functions, the gradient descent algorithm eventually finds the optimal point by updating
the below equation until the value at the next iteration is very close to the current iteration
(convergence):
x i+1=x i−❑i ∇ f ( x i )
that is, it calculates the negative of the gradient of the cost function and scales that by some
constant
i, which is known as the learning rate, and then moves in that direction at each iteration of the
algorithm
Since many cost functions in machine learning can be broken down into the sum of individual
functions, the gradient step can be broken down into adding separate gradients. However, this
process can be computationally expensive, and the algorithm may get stuck at a local minimum or
saddle point. Therefore, we can use a version of gradient descent called stochastic gradient
descent (SGD),
which adds an element of randomness so that the gradient does not get stuck. SGD uses one data
point at a time for a single step and uses a much smaller subset of data points at any given step, but
is nonetheless able to obtain an unbiased estimate of the true gradient. Alternatively, we can use
batch gradient descent (BGD), which uses a fixed, small number (a mini-batch) of data points per
step.
Gradient descent and SGD are popular topics for ML interviews since they are used to optimize the
training of almost all machine learning methods. Besides the usual questions on the high-level
concepts and mathematical details, you may be asked when you would want to use one or the other.
You might even be asked to implement a basic version of SGD in a coding interview (which we cover
in Chapter 9, problem #30).
Bias-Variance Trade-off
The bias-variance trade-off is an interview classic, and is a key framework for understanding different
kinds of models. With any model, we are usually tying to estimate a function f(x), which predicts our
target variable y based on our input x. This relationship can be described as follows:
y=f ( x )+ w
where w is noise, not captured by f(x), and is assumed to be distributed as a zero-mean Gaussian
random variable for certain regression problems. To assess how well the model fits, we can
decompose the error of y into the following:
1. Bias: how close the model's predicted values come to the true underlying f(x) values, with
smaller being better
2. Variance: the extent to which model prediction error changes based on training inputs, with
smaller being better
The trade-off between bias and variance provides a lens through which you can analyze different
models. Say we want to predict housing prices given a large set of potential predictors (square
footage of a house, the number of bathrooms, and so on). A model with high bias but low variance,
such as linear regression, is easy to implement but may oversimplify the situation at hand. This high
bias but low variance situation would mean that predicted house prices are frequently off from the
market value, but the variance in these predicted prices is low. On the flip side, a model with low
bias and high variance, such as neural networks, would lead to predicted house prices closer to
market value, but with predictions varying wildly based on the input features,
Low Bias
High Bias
While the bias-variance trade-off equation occasionally shows up in data science interviews, more
frequently, you'll be asked to reason about the bias-variance trade-off given a specific situation. For
example, presented with a model that has high variance, you could mention how you'd source
additional data to fix the issue. Posed with a situation where the model has high bias, you could
discuss how increasing the complexity of the model could help. By understanding the business and
product requirements, you'll know how to make the bias-variance trade-off for the interview
problem posed.
Underfitting refers to the opposite case — the scenario where the model is not learning enough of
the true relationship underlying the data. Because overfitting is SQ common in real-world machine
learning, interviewers commonly ask you how you can detect it, and what you can do to avoid it,
which brings us to our next topic: regularization.
Regularization
Regularization aims to reduce the complexity of models. In relation to the bias-variance trade-off,
regularization aims to decrease complexity in a way that significantly reduces variance while only
slightly increasing bias. The most widely used forms of regularization are L1 and L2. Both methods
add a simple penalty term to the objective function. The penalty helps shrink coefficients of features:
which reduces overfitting. This is why, not surprisingly, they are also known as shrinkage methods.
Specifically, L1, also known as lasso, uses the absolute value of a coefficient to the objective function
as a penalty. On the other hand, L2, also known as ridge, uses the squared magnitude of a coefficient
to the objective function, The L1 and L2 penalties can also be linearly combined, resulting in the
popular form of regularization called elastic net, Since having models overfit is a prevalent problem
in machine learning, it's important to understand when to use each type of regularization. For
example, L1 serves as a feature selection method, since many coefficients shrink to 0 (are zeroed
out), and hence, are removed from the model. L2 is less likely to shrink any coefficients to 0.
Therefore, L1 regularization leads to sparser models, and is thus considered a more strict shrinkage
operation.
Interpretability 8 Explainability
In Kaggle competitions and classwork you might be expected to maximize a model performance
metric like accuracy. However, in the real world, rather than just maximizing a particular metric, you
might also be responsible for explaining how your model came up with that output. For example, if
your model predicts that someone shouldn't get a loan, doesn't that person deserve to know why?
More broadly, interpretable models can help you identify biases in the model, which leads to more
ethical Al. Plus, in some like healthcare, there can be deep auditing on decisions, and explainable
models can help you stay compliant. However, there's usually a trade-off between performance and
model interpretability. Often, using a more complex model might increase performance, but make
results harder to interpret
Various models have their own way of interpreting feature importance, For example, linear models
have weights which can be visualized and analyzed to interpret the decision making. Similarly,
random forests have feature importance readily available to identify what the model is using and
learning. There are also some general frameworks that can help with more "black-box" models, One
is SHAP (SHapIey Additive exPlanation), which uses "Shapley" values to denote, the average marginal
contribution of a feature over all possible combinations of inputs. Another technique is LIME (Local
Interpretable Model-agnostic Explanations), which uses sparse linear models built around various
predictions to understand how any model performs in that local vicinity.
While it's rare to be asked about the details of SHAP and LIME during interviews, having a basic
understanding of why model interpretability matters, and bringing up this consideration in more
open-ended problems is key.
Model Training
We've covered frameworks to evaluate models, and selected the best-performing ones, but how do
we actually train the model in the first place? If you don't master the art of model training (aka
teaching machines to learn), even the best machine learning techniques will fail. Recall the basics:
we first train models on a training dataset and then test the models on a testing dataset. Normally,
80% of the data will go towards training data, and 20% serves as the test set. But as we soon cover,
there's much more to model training than the 80/20 train vs. test split.
Cross-Validation
Cross-validation assesses the performance of an algorithm in several subsamples of training data. It
consists of running the algorithm on subsamples of the training data, such as the original data
without some of the original observations, and evaluating model performance on the portion of the
data that was excluded from the subsample. This process is repeated many times for the different
subsamples, and the results are combined at the end.
Cross-validation helps you avoid training and testing on the same subsets of data points, which
would lead to overfitting. As mentioned earlier, in cases where there isn't enough data or getting
more data is costly, cross-validation enables you to have more faith in the quality and consistency of
a model 's test performance, Because of this, questions about how cross-validation works and when
to use it are routinely asked in data science interviews.
One popular way to do cross-validation is called k-fold cross-validation. The process is as follows:
1. Randomly shuffle data into equally-sized blocks (folds).
2. For each fold k, train the model on all the data except for fold i, and evaluate the validation error
using block i.
3. Average the k validation errors from step 2 to get an estimate of the true error.
Dataset
Another form of cross-validation you're expected to know for the interview is leave-one-out cross-
validation. LOOCV is a special case of k-fold cross-validation where k is equal to the size of the
dataset (n). That is, it is where the model is testing on every single data point during the cross-
validation.
m the case of larger datasets, cross-validation can become computationally expensive, because
every fold is used for evaluation. In this case, it can be better to use train validation split, where you
split the data into three parts: a training set, a dedicated validation set (also known as a "dev" set),
and a test set. The validation set usually ranges from 10%-20% of the entire dataset.
An interview question that comes up from time to time is how to apply cross-validation for
timeseries data. Standard k-fold CV can't be applied, since the time-series data is not randomly
distributed but instead is already in chronological order. Therefore, you should not be using data "in
the future" for predicting data "from the past." Instead, you should use historical data up until a
given point in time, and vary that point in time from the beginning till the end.
For real-life training ML models, you should also factor in training time considerations and resource
constraints during model selection. While you can always train more complex models that might
achieve marginally higher model performance metrics, the trade-off versus increased resource usage
and training time might make such a decision suboptimal.
Learning curves are plots of model learning performance over time. The y-axis is some metric of
learning (for example, classification accuracy), and the x-axis is experience (time)
1.0- — Train
— validation
0.9-
0.8-
0.7-
0.6-
0.5-
0.4-
Linear Regression
Linear regression i} a form of supervised learning, where a model is trained on labeled input data.
Linear regression is one of the most popular methods employed in machine learning and has many
real-life applications due to its quick runtime and interpretability. That's why there's the joke about
regression to regression: where you try to solve a problem with more advanced methods but end up
falling back to tried and true linear regression.
As such, linear regression questions are asked in all types of data science and machine learning
interviews. Essentially, interviewers are trying to make sure your knowledge goes beyond importing
linear regression from scikit-learn and then blindly calling linear regression.fit(X,Y). That's why deep
knowledge of linear regression — understanding its assumptions, addressing edge cases that come
up in real-life scenarios, and knowing the different evaluation metrics — will set you apart from
other candidates.
In linear regression, the goal is to estimate a function f(x), such that each feature has a linear
relationship to the target variable y, or:
y= X β
where X is a matrix of predictor variables and p is a vector of parameters that determines the weight
of each variable in predicting the target variable. So, how do you compare the performance of two
linear regression models?
2 6
A common interview question is "What's the expected impact on R 2 when adding more features to a
model?" While adding more features to a model always increases the R 2 , that doesn’t necessarily
make for a better model. Since any machine learning model can overfit by having more parameters,
a goodness-of-fit measure like R 2 should likely also be assessed with model complexity in mind.
Metrics that take into account the number of features of linear regression models include AIC, BIC,
Mallow's CP, and adjusted R2 .
Subset Selection
So, how do you reduce model complexity of a regression model? Subset selection. By default, we use
all the predictors in a linear model. However, in practice, it's important to narrow down the number
of features, and only include the most important features. One way is best subset selection, which
tries each model with k predictors, out of p possible ones, where k < p. Then, you choose the best
subset model using a regression metric like R 2 . While this guarantees the best result, it can be
computationally infeasible as p increases (due to the exponential number of combinations to try).
Additionally, by trying every option in a large search space, you're likely to get a model that overfits
with a high variance in coefficient estimates.
Therefore, an alternative is 10 use stepwise selection. In forward stepwise selection, we start with an
empty model and iteratively add the most useful predictor. In backward stepwise selection, we start
with the full model and iteratively remove the least useful predictor. While doing stepwise selection,
we aim to find a model with high R 2 and low RSS, while considering the number of predictors using
metrics like AIC or adjusted R 2 .
Note: for the independence and normality assumption, use of the term "i.i.d." (independent and
identically distributed) is also common. If any of these assumptions are violated, any forecasts or
confidence intervals based on the model will most likely be misleading or biased. As a result, the
linear regression model will likely perform poorly out of sample.
Another useful diagnostic plot is the scale-location plot, which plots standardized residuals versus
the fitted values. If the data shows heteroscedasticity, then you will not see a horizontal line with
equally spread points.
Case 1 Case 2
Scale-Location Scale-Location
-5 0 5
Normality
Linear regression assumes the residuals are normally distributed: We can lest this through a QQ plot.
Also known as a quantile plot, a QQ plot graphs the standardized residuals versus theoretical
quantiles and shows whether the residuals appear to be normally distributed (i.e., the plot
resembles a straight line). If the QQ plot is not a reasonably straight line, this is a sign that the
residuals are not normally distributed, and hence, the model should be reexamined. In that case,
transforming the dependent variable (With a log or square-root transformation, for example) can
help reduce skew.
Outliers
Outliers can have an outsized impact on regression results. There are several ways to identify
outliers. Oile of the more popular methods is examining Cook's distance, which is the estimate of the
influence of any given data point. Cook's distance takes into account the residual and leverage (how
far away the X value differs from that of other observations) of every point. In practice, it can be
useful to remove points with a Cook's distance value above a certain threshold.
Multicollinearity
Another pitfall is if the predictors are correlated. This phenomenon, known as multicollinearity,
affects the resulting coefficient estimates by making it problematic to distinguish the true underlying
individual weights of variables. Multicollinearity is most commonly observed by weights that flip
magnitude. It is one of the reasons why model weights cannot be directly interpreted as the
importance of a-feature in linear regression. Features that initially would appear to be independent
variables can often be highly correlated: for example, the number of Instagram posts made and the
number of notifications received are most likely highly correlated, since both ate related to user
activity on the platform, and one generally causes another.
One way to assess multicollinearity is by examining the variance inflation factor (V IF), which
quantifies how much the estimated coefficients are inflated when multicollinearity exists. Methods
to address multicollinearity include removing the correlated variables, linearly combining the
variables, or using PCA/PLS (partial least squares).
Confounding Variables
Multicollinearity is an extreme case of confounding, which occurs when a variable (but not the main
independent or dependent variables) affects the relationship between the independent and
dependent variables. This can cause invalid correlations. For example, say you were studying the
effects of ice cream consumption on sunburns and find that higher ice cream consumption leads to a
higher likelihood of sunburn. That would be an incorrect conclusion because temperature is the
confounding variable — higher summer temperatures lead to people eating more ice cream and also
spending more time outdoors (which leads to more sunburn).
Confounding can occur in many other ways, too. For example, one way is selection bias, where the
data are biased due to the way they were collected (for example, group imbalance). Another
problem, known as 0Luüted variable bias, occurs when important variables are omitted, resulting in
a linear regression model that is biased and inconsistent. Omitted variables can stem from dataset
generation issues or choices made during modeling. A common way to handle confounding is
stratification, a process where you create multiple categories or subgroups in which the confounding
variables do not vary much, and then test significance and strength of associations using chi square.
Knowing about these regression edge cases, how to identify them, and how to guard against them is
crucial. This knowledge separates the seasoned data scientists from the data neophyte — precisely
why it's such a popular topic for data science interviews.
Random Component: is the distribution of the error term, i.e., normal distribution för linear
regression.
Systematic Component: consists of the explanatory variables, i.e., the predictors combined in a
linear combination.
Link function: is the link between the random and system components, i.e., a linear regression,
logit regression, etc.
Nole that in GLMs, the response variable is still a linear combination of weights and predictors.
Regression can also use the weights and predictors nonlinearly; the most common examples of this
are polynomial regressions, splines, and general additive models. While interesting, these
techniques are rarely asked about in interviews and thus are beyond the scope of this book.
Classification
General Framework
Interview questions related to classification algorithms are commonly asked during interviews due to
the abundance of real-life applications for assigning categories to things. For example, classifying
users as likely to churn or not, predicting whether a person will click on an ad or not, and
distinguishing fraudulent transactions from legitimate ones are all applications of the classification
techniques we mention in this section.
The goal of classification is to assign a given data point to one of K possible classes instead of
calculating a continuous value (as in regression). The two types of classification models are
generative models and discriminative models. Generative models deal with the joint distribution of X
and Y, which is defined as follows:
p( X ,Y )= p(YlX ) p( X)
Maximizing a posterior probability distribution produces decision boundaries between classes where
the resulting posterior probability is equivalent. The second type of model is discriminative. It
directly determines a decision boundary by choosing the class that maximizes the probability:
^y =arg max p(Y =k∨x )
Thus, both methods choose a predicted class that maximizes the posterior probability distribution;
the difference is simply the approach. While traditional classification deals with just two classes (0 or
1), multi-class classification is common, and many of the below methods can be adapted to handle
multiple labels.
Evaluating Classifiers
Before we detail the various classification algorithms like logistic regression and Naive Bayes it's
essential to understand how to evaluate the predictive power of a classification model.
Say you are trying to predict whether an individual has a rare cancer that only happens to 1 in
10,000 people. By default, you could simply predict that every person doesn't have cancer and be
accurate 99.99% of the time. But clearly, this isn't a helpful model — Pfizer won't be acquiring our
diagnostic test anytime soon! Given imbalanced classes, assessing accuracy alone is not enough —
this is known as the "accuracy paradox" and is the reason why it's critical to look at other measures
for misclassified observations,
incorrectly predicts that an instance belongs to the positive class. For the cancer detection example,
a false positive would be classifying an individual as having cancer, when in reality, the person does
not have it. On the other hand, a false negative occurs when the model incorrectly produces a
negative class. In the cancer diagnostic case, this would mean saying a person doesn't have cancer,
when in fact they do.
A confusion matrix helps organize and visualize this information. Each row represents the actual
number of observations in a class, and each column represents the number of observations
predicted as belonging to a class.
Predicted
Positive Negativ
e
False Negative (FN) Sensitivity
Positive True Positive (TP)
Actual Class Type 2 Error
False Positive (FP) Specificity
Negative True Negative (TN)
Type 1 Error
Precision Negative productive Accuracy
Value
the model performs better in separating the classes. The most optimal is a curve that "hugs" the top
left of the plot, as shown below. This indicates that a model has a high true-positive rate and
relatively low false-positive rate.
ROC Curve
Logistic Regression
One of the most popular classification algorithms is logistic regression, and it is asked about almost
as frequently as linear regression during interviews. In logistic regression, a linear output is
converted into a probability between 0 and 1 using the sigmoid function:
1
S ( x )= −xβ
1+ e
In the equation above, X is the set of predictor features and is the corresponding vector of weights.
Computing S(x) above produces a probability that indicates if an observation should be classified as a
"1 " (if the calculated probability is at least 0.5), and a "O" otherwise.
P ( Y^ =1| X )=S( Xβ)
Linear Regression Logistic Regression
The loss function for logistic regression, also known as log-loss, is formulated as follows:
Naive Bayes
Naive Bayes classifiers require only a small amount of training data to estimate the necessary
parameters. They can be extremely fast compared to more sophisticated methods (such as support
vector machines). These advantages lead to Naive Bayes being a popularly used first technique in
modeling, and is why this type of classifier shows up in interviews.
Naive Bayes uses Bayes' rule (covered in Chapter 6: Statistics) and a set of conditional independence
assumptions in order to learn P(YIX). There are two assumptions to know about Naive Bayes:
1.It assumes each X i is independent of any other x j given Y for any pair of features X i and X j.
2.It assumes each feature is given the same weight.
The decoupling of the class conditional feature distributions means that each distribution can be
independently estimated as a one-dimensional distribution. That is, we have the following:
n
P ( X 1 … X n|Y ) =∏ P (¿ X i∨Y ) ¿
t=1
Using the conditional independence assumption, and then applying Bayes’ theorem, the
classification rule becomes:
To understand the beauty of Naive Bayes, recall that for any ML model having k features, there are 2 k
possible feature interactions (the correlations between them all). Due to the large number of feature
interactions, typically you'd need 2k data points for a high-performing model. However, due to the
conditional independence assumption in Naive Bayes, there only need to be k data points, which
removes this problem.
For text classification (e.g., classifying spam, sentiment analysis), this assumption is convenient since
there are many predictors (words) that are generally independent of one another.
While the assumptions simplify calculations and make Naive Bayes highly scalable to run, they are
often not valid. In fact, the first conditional independence assumption generally never holds true,
since features do tend to be correlated. Nevertheless, this technique performs well in practice since
most data is linearly separable.
SVMs
The goal of S VM is to form a hyperplane that linearly separates the training data. Specifically, it aims
to maximize the margin, which is the minimum distance from the decision boundary to any training
point. The points closest to the hyperplane are called the support vectors. Note that the decision
boundaries for SVMs can be nonlinear, which is unlike that of logistic regression, for example.
In the image above, it's easy to visualize how a line can be found that separates the points correctly
into their two classes. In practice, splitting the points isn't that straightforward. Thus, SVMs rely on a
kernel to transform data into a higherdimensional space, where it then finds the hyperplane that
best separates the points. The image below visualizes this kernel transformation:
Decision Trees
Decision trees and random forests are commonly discussed during interviews since they are flexible
and often perform well in practice for both classification and regression use cases. Since both use
cases are possible, decision trees are also known as CART (classification and regression trees). For
this section, we'll focus on the classification use case for decision trees. While reading this section,
keep in mind that for interviews, it helps to understand how both decision trees and random forests
are trained. Related topics of entropy and information gain are also crucial to review before a data
science interview.
Training
A decision tree is a model that can be represented in a treelike form determined by binary splits
made in the feature space and resulting in various leaf nodes, each with a different prediction. Trees
are trained in a greedy and recursive fashion, starting at a root node and subsequently proceeding
through a series of binary splits in features (i.e., variables) that lead to minimal error in the
classification of observations.
Survival of Passengers on the Titanic
gender
male female
age Survived
0.73 36%
died SBISP
0.17 60%
Entropy
The entropy of a random variable Y quantifies the uncertainly in Y. Fora discrete variable Y (assuming
k stales) it is stated as follows:
h
H ( Y )=−∑ P ( y=k ) log P(Y =k )
i=1
For example, for a simple Bernoulli random variable, this quantity is highest when p = 0.5 and lowest
when p = 0 or p = 1, a behavior that aligns intuitively with its definition since if p = 0 or 1, then there
is no uncertainty with respect to the result. Generally, if a random variable has high entropy, its
distribution is closer to uniform than a skewed one. There are many measures of entropy — in
practice, the Gini index is commonly used for decision trees.
In the context of decision trees, consider an arbitrary split. We have H(Y) from the initial training
labels and assume that we have some feature X on which we want to split. We can characterize the
reduction in uncertainty given by the feature X, known as information gain, which can be formulated
as follows:
IG(Y , X )=H (Y )−H (Y ∨ X)
The larger IG(Y, X) is, the higher the reduction in uncertainty in Y by splitting on X. Therefore, the
general process assesses all features in consideration and chooses the feature that maximizes this
information gain, then recursively repeals the process on the two resulting branches.
Random Forests
Typically, an individual decision tree may be prone to overfilling because a leaf node can be created
for each observation. In practice, random forests yield better out-ofsample predictions than decision
trees. A random forest is an ensemble method that can utilize many decision trees, whose decisions
it averages.
Two characteristics of random forests allow a reduction in overfitting and the correlation between
the trees. The first is bagging, where individual decision trees are fitted following cach bootstrap
sample and then averaged afterwards. Bagging significantly reduces the variance of the random
forest versus the variance of any individual decision trees. The second way random forests reduce
overfitting is that a random subset of features is considered at each split, preventing the important
features from always being present at the tops or individual trees.
Random forests are often used due to their versatility, interpretability (you can quickly see feature
importance), quick training times (they can be trained in parallel), and prediction performance. In
interviews, you'll be asked about how they work versus a decision tree, and when you would use a
random forest over other techniques.
Boosting
Boosting is a type of ensemble model that trains a sequence of "weak" models (such as small
decision trees), where each one sequentially compensates for the weaknesses of the preceding
models. Such weaknesses can be measured by the current model's error rate, and the relative error
rates can be used to weigh which observations the next models should focus on. Each training point
within a dataset is assigned a particular weight and is continually re-weighted In an iterative fashion
such that points that are mispredicted take on higher weights in each iteration. In this way, more
emphasis is placed on points that are harder to predict. This can lead to overfitting if the data is
especially noisy.
One example is AdaBoost (adaptive boosting), which is a popular technique used to train a model
based on tuning a variety of weak learners. That is, it sequentially combines decision frees with a
single split, and then weights arc uniformly set for all data points. At each iteration, data points are
re-weighted according to whet each was classified correctly or incorrectly by a classifier. Al the end,
weighted predictions or each classifier are combined to obtain a final prediction.
The generalized form of AdaBoost is called gradient boosting. A well-known form of gradient
boosting used in practice is called XGBoost (extreme gradient boosting). Gradient boosting is similar
to AdaBoost, except that shortcomings of previous models are identified by the gradient rather than
high weight points, and all classifiers have equal weights instead of having different weights. In
industry, XGBoost is used heavily due to its execution speed and model performance.
Since random forests and boosting are both ensemble methods, interviewers tend to ask questions
comparing and contrasting the two. For example, one of the most common interview questions is
"What is the difference between XGBoost and a random forest?"
Dimensionality Reduction
Imagine you have a dataset with one million rows but two million features, most of which are null
across the data points. You can intuitively guess that it would be hard to tease out which features are
predictive for the task at hand. In geometric terms, this situation demonstrates sparse data spread
over multiple dimensions, meaning that each data point is relatively far away from other data points.
This lack of distance is problematic, because when extracting patterns using machine learning, the
idea of similarity or closeness of data often matters a great deal. If a particular data point has
nothing close to it, how can an algorithm make sense of it?
This phenomenon is known as the curse of dimensionality. One way to address this problem is to
increase the dataset size, but often, in practice, it's costly or infeasible to get more training data.
Another way is to conduct feature selection, such as removing multicollinearity, but this can be
challenging with a very large number of features.
Instead, we can use dimensionality reduction, which reduces the complexity of the problem with
minimal loss of important information. Dimensionality reduction enables you to extract useful
information from such data, but can sometimes be difficult or even too expensive, since the
algorithm we would use would incorporate so many features. Decomposing the data into a smaller
set of variables is also useful for summarizing and visualizing datasets. For example, dimensionality
reduction methods can be used to project a large dataset into 2D or 3D space for easier visualization.
Hence, the algorithm proceeds by first finding the component having maximal variance. Then, the
second component found is uncorrelated with the first and has the second-highest variance, and so
on for the other components. The algorithm ends with some number k dimensions such that
y 1 , … y k explain the majority of k variance, k << p
The final result is an eigendecomposition of the covariance matrix of X, where the first principal
component is the eigenvector corresponding to the largest eigenvalue and the second principal
component corresponds to the eigenvector with the second largest eigenvalue, and so on. Generally,
the number of components you choose is based on your threshold for the percent of variance your
principal components can explain. Note that while PCA is a linear dimensionality reduction method,
t-distributed stochastic neighbor embedding (t-SNE) is a non-linear, non-deterministic method used
for data visualization.
In interviews, PCA questions often test your knowledge Of the assumptions (like that the variables
need to have a linear relationship). Commonly asked about as well are pitfalls of PCA, like how it
struggles with outliers, or how it is sensitive to the units of measurement for the input features (data
should be standardized). For more ML-heavy roles, you may be asked to whiteboard the
eigendecomposition.
Clustering
Clustering is a popular interview topic since il is the most commonly employed unsupervised
machine learning technique. Recall that unsupervised teaming means that there is no labeled
training data, i.e., the algorithm is trying to infer structural patterns within the data, without a
prediction task in mind. Clustering is often done to find "hidden" groupings in data, like segmenting
customers into different groups, where the customers in a group have similar characteristics.
Clustering can also be used for data visualization and outlier identification, as in fraud detection, for
instance. The goal of clustering is to partition a dataset into various clusters or groups by looking only
at the data's input features.
Ideally, the clustered groups have two properties:
Points within a given cluster are similar (i.e., high intra-cluster similarity).
Points in different clusters are not similar (i.e., low inter-cluster similarity).
K-Means clustering
A well-known clustering algorithm, k-means clustering is often used because it is easy to interpret
and implement. It proceeds, first, by partitioning a set of data into k distinct clusters and then
arbitrarily selects centroids of each of these clusters. It iteratively updates partitions by first
assigning points to the closest cluster, then updating centroids, and then repeating this process until
convergence. This process essentially minimizes the total inter-cluster variation across all clusters.
K-Means
Mathematically, k-means clustering reaches a solution by minimizing a loss function (also known as
distortion function). In this example, we minimize Euclidean distance (given x i points and centroid
value μ j):
h
L−∑ ∑ ‖x i−μ j‖
2
j=1 xε S j
K-means Alternatives
One alternative to k-means is hierarchical clustering. Hierarchical clustering assigns data points to
their own cluster and merges clusters that are the nearest (based on any variety of distance metrics)
until there is only one cluster left, generally visualized using a dendrogram. In cases where there is
not a specific number of clusters, or you want a more interpretable and informative output,
hierarchical lustering is more useful than k-means.
While quite similar to k-means, density clustering is another distinct technique. The most well-
known implementation of this technique is DBSCAN. Density clustering does not require a number of
clusters as a parameter. Instead, it infers that number, and leams to identify clusters of arbitrary
shapes. Generally, density clustering is more helpful for outlier detection than k-means.
For example, TikTok may be on the lookout for anomalous profiles, and can use GMMs to cluster
various accounts based on features (number of likes sent, messages sent, and comments made) and
identify any accounts whose activity metrics don't seem to fall within the typical user activity
distributions.
Cluster 2
Compared to k-means, GMMs are more flexible because k-means only takes into account the mean
of a cluster, while GMMs take into account the mean and variance. Therefore, GMMs are particularly
useful in cases with low-dimensional data or where cluster shapes may be arbitrary. While practically
never asked about for data science interviews (compared to k-means), we brought up GMMs for
those seeking more technical ML research and ML engineering positions.
Neural Networks
While the concepts behind neural networks have been around since the 1950s, it's only in the last
15 years that they've grown in popularity, thanks to an explosion of data being created, along with
the rise of cheap cloud computing resources needed to store and process the massive amounts of
newly created data. As mentioned earlier in the chapter, if your resume has any machine learning
projects involving deep learning experience, then the technical details behind neural networks will
be considered fair game by most interviewers, But for a product data science position or a finance
role (where data can be very noisy, so most models are not purely neural networks), don't expect to
be bombarded with tough neural network questions. Knowing the basics of classical ML techniques
should suffice.
When neural nets are brought up during interviews, questions can range anywhere from qualitative
assessments on how deep learning compares to more traditional machine learning models to
mathematical details on gradient descent and backpropagation. On the qualitative side, it helps to
understand all of the components that go into training neural networks, as well as how neural
networks compare to simpler methods.
Perceptron
Neural networks function in a way similar to biological neurons. They take in various inputs (at input
layers), weight these inputs, and then combine the weighted inputs through a linear combination
(much like linear regression). If the combined weighted output is past some threshold set by an
activation function, the output is then sent out to other layers. This base unit is generally referred to
as a perceptron. Perceptrons are combined to form neural networks, which is why they are also
known as multi-layer perceptrons (MLPs).
While the inputs for a neural network are combined via a linear combination, often, the activation
function is nonlinear. Thus, the relationship between the target variable and the predictor features
(variables) frequently ends up also being nonlinear. Therefore, neural networks are most useful
when representing and leaming nonlinear functions.
For reference, we include a list of common activation functions below. The scope of when to use
which activation function is outside of this text, but any person interviewing for an ML-intensive role
should know these use cases along with the activation function's formula.
In neural networks, the process of receiving inputs and generating an output continues until an
output layer is reached. This is generally done in a forward manner, meaning that layers process
incoming data in a sequential forward way (which is why most neural networks are known as "feed-
forward"). The layers of-neurons that are not the input or output layers are called the hidden layers
(hence the name "deep learning" for neural networks having many of these). Hidden layers allow for
specific transformations of the data within each layer. Each hidden layer can be specialized to
produce a particular output — for example, in a neural network used for navigating roads, one
hidden layer may identify stop signs, and another hidden layer may identity traffic lights. While those
hidden layers are not enough to independently navigate roads, they can function together within a
larger neural network io drive better than Nick at age 16.
Input Data
Output
Layer 1 Layer N
Hidden Layer
Backpropagation
The learning process for neural networks is called back-propagation, This technique modifies the
weights of the neural network iteratively through calculation of deltas between predicted and
expected outputs. After this calculation the weights are updated backward through earlier layers via
stochastic gradient descent. This process continues until the weights that minimize the loss function
are found.
For regression tasks; the commonly used loss function to be, optimized is squared error, whereas for
classification tasks the common loss function used is cross-entropy Given a loss function L, we can
update the weights through the chain rule, of the following form, where z is the model’s output (and
the best guess of our Larget variable y):
∂ L(z , y)
∗∂ x
∂x
∗∂ z
∂ L(z , y ) ∂z
=
∂w ∂w
and the weights are updated via:
∂ L(z , y )
w=w−a
∂w
For ML-heavy roles, we've seen interviewers expect candidates to explain the technical details
behind basic backpropagation on a whiteboard, for basic methods such as linear regression or
logistic regression.
Interviewers also like to ask about the hyperparameters involved in neural networks. For example,
the amount that the weights are updated during each training step, a, is called the learning rate. If
the learning rate is 100 small, the optimization process may freeze. Conversely, if the-learning rate is
too large, the optimization might converge prematurely at a suboptimal solution. Besides the
learning rate, other hyperparameters in neural networks include the number of hidden layers, the
activation functions used, batch size, and so on. For an interview, it's helpful to know how each
hyperparameter affects a neural network's training time and model performance.
General Framework
One issue that can come up in training neural nets is the problem of vanishing gradients Vanishing
gradients refers to the fact that sometimes the gradient of the loss function will be tiny, and may
completely stop the neural network from training because the weights aren't updated properly.
Since backpropagation uses the chain rule, multiplying n small numbers to compute gradients for
early layers in a network means that the gradient gets exponentially smaller With more layers. This
can happen particularly with traditional activation functions like hyperbolic tangent, whose gradients
range between zero and one, The opposite problem, where activation functions p create large
derivatives is known as the exploding gradient problem
One common technique to address extremes in gradient values is to allow gradients from later layers
to directly pass into earlier layers without being multiplied many times — something which residual
neural networks (ResNets) and LSTMs both utilize. Another approach to prevent extremes in the
gradient values is to alter the magnitude of the gradient changes by changing the activation function
used (for example, ReLU). The details behind these methods are beyond this book's scope but are
worth looking into for ML-heavy interviews.
Momentum is one such optimization method used to accelerate learning while using SGD. While
using SGD, we can sometimes see small and noisy gradients. To solve this, we can introduce a new
parameter, velocity, which is the direction and speed at which the learning dynamics change. The
velocity changes based on previous gradients (in an exponentially decaying manner) and increases
the step size for learning in any iteration, which helps the gradient maintain a consistent direction
and pace throughout the training process.
Transfer Learning
Lastly, for training neural networks, practitioners use or repurpose a pre-trained layer (components
of a model that have already been trained and published). This approach is called transfer learning
and is especially common in cases where models require a large amount of data (for example, BERT
for language models and ImageNet for image classification). Transfer leaming is beneficial when you
have insufficient data for a new domain, and there is a large pool of existing data that can be
transferred to the problem of interest. For example, say you wanted to help Jian Yang from the TV
show Silicon_Valley build an app to detect whether something was a hot dog or not a hotdog. Rather
than just using your 100 images of hot dogs, you can use ImageNet (which was trained on many
millions of images) to get a great model right off the bat, and then layer on any extra specific (raining
data you might have to further improve accuracy.
Addressing Overfitting
Deep neural networks are prone to overfilling because of the model complexity (there are many
parameters involved). As such, interviewers frequently ask about the variety of techniques which are
used to reduce the likelihood of a neural network overfitting. Adding more training data is the
simplest way to address variance if you have access to significantly more data and computational
power to process that data. Another way is to standardize features (so each feature has 0 mean and
unit variance), since this speeds up the learning algorithm. Without normalized inputs, each feature
takes on a wide range of values, and the corresponding weights for those features can vary
dramatically, resulting in larger updates in backpropagation. These large updates may cause
oscillation in the weights during the learning stage, which causes overfitting and high variance.
Batch normalization is another technique to address overfitting. In this process, activation values are
normalized within a given batch so that the representations at the hidden layers do not vary
drastically, thereby allowing each layer of a network to team more independently of one another.
This is done for each hidden neuron, and also improves training speed. Here, applying a
standardization process similar to how inputs are standardized is recommended.
Lastly, dropout is a regularization technique that deactivates several neurons randomly at each
training step to avoid overfitting. Dropout enables simulation of different architectures, because
instead of a full original neural network, there will be random nodes dropped at each layer. Both
batch normalization and dropout help with regularization since the effects they have are similar to
adding noise to various parts of the training process.
CNNs
Convolutional neural networks (CNNs) are heavily used in computer vision because they can capture
the spatial dependencies of an image through a series of filters. Imagine you were looking at a
picture
:
of some traffic lights. Intuitively, you need to figure out the components of the lights (i.e., red, green,
yellow) when processing the image. CNNs can determine elements within that picture by looking at
various neighborhoods of the pixels in the image. Specifically, convolution layers can extract features
such as edges, color, and gradient orientation. Then, pooling layers apply a version of dimensionality
reduction in order to extract the most prominent features that are invariant to rotation and position.
Lastly, the results are mapped into the final output by a fully connected layer.
RNNs
Recurrent neural networks (RNNs) are another common type of neural network. In an RNN, the
nodes form a directed graph along a temporal sequence and use their internal state (called
memory). RNNs are often used in learning sequential data such as audio or video — cases where the
current context depends on past history. For example, say you are looking through the frames of a
video. What will happen in the next frame is likely to be highly related to the current frame, but not
as related to the first frame of the videos Therefore, when dealing With sequential data, having a
notion of memory is crucial for accurate predictions. In contrast to CNNs, RNNs can handle arbitrarily
input and output lengths and are not feed-forward neural networks, instead using this internal
memory to process arbitrary sequences of data.
LSTMs
Long Short-Term Memory (LSTMs) are a fancier version of RNNs. In LSTMs, a common unit is
composed of a cell, an input gate (writing to a cell or not), an output gate (how much to write to a
cell), and a forget gate (how much to erase from a cell). This architecture allows for regulating the
flow of information into and out of any cell. Compared to vanilla RNNs, which only learn short-term
dependencies, LSTMs have additional properties that allow them to learn long-term dependencies.
Therefore, in most real-world scenarios, LSTMs are used instead of RNNs.
Reinforcement Learning
Reinforcement learning (RL) is an area of machine learning outside of supervised and unsupervised
learning. RL is about teaching an agent to learn which decisions to make in an environment to
maximize some reward function. The agent takes a series of actions throughout a variety of states
and is rewarded accordingly, During the learning process, the agent receives feedback based on the
actions taken and aims to maximize the overall value acquired.
Is ML even needed? Maybe a simple heuristics or a rules-based approach works well enough? Or
perhaps a hybrid approach with humans in the loop would work best?
Is it even legal or ethical to apply ML to this problem? Are there regulatory issues at play
dictating what kinds of data or models you can use? For example, lending institutions cannot
legally use some demographic variables like race.
How do end users benefit from the solution, and how would they use the solution (as a
standalone, or an integration with existing systems)?
Is there a clear value add to the business from a successful solution? Are there any other
stakeholders who would be affected?
If an incorrect prediction is made, how will it impact the business? For example, a spam email
making its way into your inbox isn't as problematic as a high-risk mortgage application
accidentally being approved.
Does ML need to solve the entire problem, end-to-end, or can smaller decoupled systems be
made to solve sub-problems, whose output is then combined? For example, do you need to
make a full self-driving algorithm, or separate smaller algorithms for environment perception,
path planning, and vehicle control?
Once we understood the business problem that your ML solution is trying to solve, you can clarify
some of the technical requirements. Aligning on the technical requirements is especially important
when confronted with a ML systems design problem. Some questions to ask to anchor the
conversation:
What's the latency needed? For example, search autocomplete is useless if it takes predictions
longer to load than it takes users to type out their full query. Does every part of the system need
to be real time—while inference may need to be fast, can training be slow?
Are there any throughput requirements? How many predictions do you need to serve every
minute?
Where is this model being deployed? Does the model need to fit on-device? If so, how big is too
big to deploy? And how costly is deployment? For example, adding a high-end GPU to a car is
feasible cost-wise, but adding one to a drone might not be.
While spending so much time on problem definition may seem tedious, the reality is that defining
the right solution for the right problem can save you many weeks of technical work and painful
iterations later down the road. That's why interviewers, when posing open-ended ML problems,
expect you to ask the right questions — ones that scope down your solution. By clarifying these
constraints and objectives up front, you make better decisions on downstream steps of the end-to-
end workflow. To further your business and product clarification skills, read the sections on product
sense and company research in Chapter 10: Product Sense.
And one last piece of advice: don’t go overboard with the questions! Remember, this is a time-bound
interview, so make sure your questions and assumptions are reasonable and relevant (and concise).
You don’t want to be like a toddler and ask 57 questions without getting anywhere.
solving the business problem. For example, for a customer support request classification model, a
90% model accuracy means that 50% of the customer tickets that previously needed to be rerouted
now end up in the right place, resulting in a 10% decrease in time to resolution.
In real-world scenarios, it's best to opt for a single metric rather than picking multiple metrics to
capture different sub-goals. That's because a single metric makes it easier to rank model
performance. Plus, it's easier to align the team around optimizing a single number. However, in
interview contexts, it may be beneficial to mention multiple metrics to show you've thought about
the various goals and trade-offs your ML solution needs to satisfy. As such, in an interview, we
recommend you stan your answer with a single metric, but then hedge your answer by mentioning
other potential metrics to track.
For example, if posed a question about evaluation metrics for a spam classifier, you could start off by
talking about accuracy, and then move on to precision and recall as the conversation becomes more
nuanced. In an effort to optimize a single metric, you could recommend using the F-1 score. A
nuanced answer could also incorporate an element of satisficing — where a secondary metric is just
good enough, For example, you could optimize precision @ recall 0.95 — i.e., constraining the recall
to be at least 0.95 while optimizing for precision. Or you could suggest blending multiple metrics into
one by weighting different sub-metrics, such as false positives versus false negatives, to create a final
metric to track. This is often known as an OEC (overall evaluation criterion), and gives you a balance
between different metrics.
Once you've picked a metric, you need to establish what success looks like. While for a classifier, you
might desire 100% accuracy, is this a realistic bar for measuring success? Is there a threshold that's
good enough? This is why inquiring about baseline performance in Step 1 becomes crucial. If
possible, you should use the performance of the existing setup for comparison (for examples if the
average time to resolution for customer support tickets is 2 hours, you could aim for 1 hour — not a
97% ticket classification accuracy). Note: in real-world scenarios, the bar for model performance isn't
as high as you'd think to still have a positive business impact.
Be sure to voice all these metric considerations to your interviewer so that you can show you've
thought critically about the problem. For more guidance, read Chapter 10: Product Sense, which
covers the nuances and pitfalls of metric selection.
To boost model performance, it might not be about collecting more data generally. Instead, you can
intentionally source more examples of edge cases via data augmentation or artificial data synthesis.
For example, suppose your traffic light detector struggles in low-contrast situations. You could make
a vision of your training images that has less contrast in order to give your neural network more
practice on these trickier photos. Taken to the extreme, you can even simulate the entire
environment, as is common in the self-driving car industry. Simulation is used in the autonomous
vehicle space because encountering the volume of rare and risky situations needed to adequately
train a model based on only real-world driving is infeasible.
Finally, do you understand the data? Questions to consider:
How fresh is the data? How often will the data be updated?
Is there a data dictionary available? Have you talked to subject matter experts about it?
How was the data collected? Was there any sampling, selection, or response bias?
Imputing the missing values via basic methods such as column mean/median.
Stemming: reduces a word down to a root word by deleting characters (for example, turning the
words "liked" and ' 'likes" into "like").
Lemmatization: somewhat similar to stemming, but instead of just reducing words into roots, it
takes into account the context and meaning of the word (for example, it would turn "caring" to
"care," whereas stemming would turn "caring" to "car").
Filtering: removes "stop words" that don't add value to a sentence like "the" and "a", along with
removing punctuation.
Bag-of-words: represents text as a collection of words by associating each word and its
frequency.
N-grams: an extension of bag-of-words where we use N words in a sequence.
Word embeddings: a representation that converts words to vectors that encode the meaning of
the word, where words that are closer in meaning are closer in vector space (popular methods
include word2vec and GloVe).
Step 9: Deployment
So, now that you've picked out a model, how do you deploy it? The process of operationalizing the
entire deployment process is referred to as "MLOps" when DevOps meets ML. Two popular tools in
this space are Airflow and MLFlow. Because frameworks come and go, and many large tech
companies use their own internal versions of these tools, it's rare to be asked about these specific
technologies in interviews. However, knowledge of high-level deployment concepts is still helpful.
Generally, systems are deployed online, in batch, or as a hybrid of the two approaches. Online means
latency is critical, and thus model predictions are made in real time. Since model predictions
generally need to be served in real times there will typically be a caching layer of cached features.
Downsides for online deployment are that it can be computationally intensive to meet latency
requirements and requires robust infrastructure monitoring and redundancy.
Batch means predictions are generated periodically and is helpful for cases where you don't need
immediate results (most recommendation systems, for example) or require high throughput. But the
downside is that batch predictions may not be available for new data (for example, a
recommendation list cannot be updated until the next batch is computed). Ideally, you can work
with stakeholders to find the sweet spot. where a batch predictor is updated frequently enough to
be "good enough" to solve the problem at hand.
One deployment issue worth bringing up, that's common to both batch and online ML systems, is
model degradation. Models degrade because the underlying distributions of data for your model
change. For a concrete example, suppose you were working for a clothing e-commerce site and were
training a product recommendation model in the winter time. Come summer; you might accidentally
be recommending Canada Goose jackets in July not a very relevant product suggestion to anyone
besides Drake.
This feature drift leads to the phenomenon known the training-serving skew, where there's a
performance hit between the model in training and evaluation time versus when the model is served
in production. To show awareness for the training-serving skew issue, be sure to mention to
your
7.27. Two Sigma: Describe the kernel trick in SVMs and give a simple example. How do you decide
what kernel to choose?
7.28. Morgan Stanley: Say we have N observations for some variable which we model as being drawn
from a Gaussian distribution. What are your best guesses for the parameters of the
distribution?
7.29. Stripe: Say we are using a Gaussian mixture model (GMM) for anomaly detection of fraudulent
transactions to classify incoming transactions into K classes. Describe the model setup
formulaically and how to evaluate the posterior probabilities and log likelihood. How can we
determine if a new transaction should be deemed fraudulent?
7.30. Robinhood: Walk me through bow you'd build a model to predict whether a particular
Robinhood user will churn?
7.31. Two Sigma: Suppose you are running a linear regression and model the error terms as being
normally distributed. Show that in this setup, maximizing the likelihood of the data is equivalent
to minimizing the sum of the squared residuals.
7.32. Uber: Describe the idea behind Principle Components Analysis (PCA) and describe its
formulation and derivation in matrix form. Next, go through the procedural description and
solve the constrained maximization.
7.33. Citadel: Describe the model formulation behind logistic regression. How do you maximize the
log-likelihood of a given model (using the two-class case)?
7.34. Spotify: How would you approach creating a music recommendation algorithm for Discover
Weekly (a 30-song weekly playlist personalized to an individual user)?
7.35. Google: Derive the variance-covariance matrix of the least squares parameter estimates in
matrix form.
Another way is to resample classes by running ensemble models with different ratios of the classes,
or by running an ensemble model using all samples of the rare class and a differing amount of the
abundant class. Note that some models, such as logistic regression, are able to handle unbalanced
classes relatively well in a standalone manner. You can also adjust the probability threshold to
something besides 0.5 for classifying the unbalanced outcome.
Lastly, you can design your own cost function that penalizes wrong classification of the rare class
more than wrong classifications of the abundant class, This is useful if you have to use a particular
kind of model and you're unable to resample. However, it can be complex to set up the penalty
matrix, especially with many classes.
Solution #7.2
We can denote squared error as MSE and absolute error as MAE. Both are measures of distances
between vectors and express average model prediction in units of the target variable. Both can range
from 0 to infinity; the lower the score, the better the model.
The main difference is that errors are squared before being averaged in MSE, meaning there is a
relatively high weight given to large errors. Therefore, MSE is useful when large errors in the model
are trying to be avoided, This means that outliers disproportionately affect MSE more than MAE —
meaning that MAE is more robust to outliers. Computation-wise, MSE is easier to use, since the
gradient calculation is more straightforward than that of MAE, which requires linear programming to
compute the gradient.
Therefore, if the model needs to be computationally easier to train or doesn't need to be robust to
outliers, then MSE should be used. Otherwise, MAE is the better option, Lastly, MSE corresponds to
maximizing the likelihood of Gaussian random variables, and MAE does not. MSE is minimized by the
conditional mean, whereas MAE is minimized by the conditional median.
Solution #7.3
The elbow method is the most well-known method for choosing k in It-means clustering. The
intuition behind this technique is that the first few clusters will explain a lot of the variation in the
data, but past a certain point, the amount of information added is diminishing. Looking at a graph of
explained variation (on the y-axis) versus the number of clusters (k), there should be a sharp change
in the y-axis at some level of k. For example, in the graph that follows, we see a drop off at
approximately k=6.
Note that the explained variation is quantified by the within-cluster sum of squared errors. To
calculate this error metric, we look at, for each cluster, the total sum of squared errors (using
Euclidean distance). A caveat to keep in mind: the assumption of a drop in variation may not
necessarily be true — the y-axis may be continuously decreasing slowly (i.e., there is no significant
drop).
Another popular alternative to determining k in k-means clustering is to apply the silhouette
method, which aims to measure how similar points are in its cluster compared to other clusters.
Concretely, it looks at:
(x− y)
max ( x , y )
where x is the mean distance to the examples of the nearest cluster, and y is the mean distance to
other examples in the same cluster. The coefficient varies between -1 and 1 for any given point. A
value of 1 implies that the point is in the ''right" cluster and vice versa for a score of -1. By plotting
the
score on the y-axis versus k, we can get an idea for the optimal number of clusters based on this
metric. Note that the metric used in the silhouette method is more computationally intensive to
calculate for all points versus the elbow method.
Elbow Method for Optimal k
k
Taking a step back, while both the elbow and silhouette methods serve their purpose, sometimes it
helps to lean on your business intuition when choosing the number of clusters. For example, if you
are clustering patients or customer groups, stakeholders and subject matter experts should have a
hunch concerning how many groups they expect to see in the data. Additionally, you can visualize
the features for the different groups and assess whether they are indeed behaving similarly. There is
no perfect method for picking k, because if there were, it would be a supervised problem and not an
unsupervised one.
Solution #7.4
Investigating outliers is often the first step in understanding how to treat them. Once you understand
the nature of why the outliers occurred, there are several possible methods we can use:
Add regularization: reduces variance, for example L1 or L2 regularization.
Try different models: can use a model that is more robust to outliers. For example, tree-based
models (random forests, gradient boosting) are generally less affected by outliers than linear
models.
Winsorize data: cap the data at various arbitrary thresholds. For example, at a 90% winsorization,
we can take the top and bottom 5% of values and set them to the 95th and 5th percentile of
values, respectively.
Transform data: for example, do a log transformation when the response variable follows an
exponential distribution or is right skewed.
Change the error metric to be more robust: for example, for MSE, change it to MAE or Huber
loss.
Remove outliers: only do this if you're certain that the outliers are true anomalies not worth
incorporating into the model. This should be the last consideration, since dropping data means
losing information on the variability in the data.
Solution #7.5
There will be two primary problems when running a regression if several of the predictor variables
are correlated. The first is that the coefficient estimates and signs will vary dramatically, depending
on what particular variables you included in the model. Certain coefficients may even have
confidence intervals that include 0 (meaning it is difficult to tell whether an increase in that X value is
associated with an increase or decrease in Y or not), and hence results will not be statistically
significant. The second is that the resulting p-values will be misleading. For instance, an important
variable might have a high p-value and so be deemed as statistically insignificant even though it is
actually important. It is as if the effect of the correlated features were "split" between them, leading
to uncertainty about which features are actually relevant to the model.
You can deal with this problem by either removing or combining the correlated predictors. To
effectively remove one of the predictors, it is best to understand the causes of the correlation (i.e.,
did you include extraneous predictors such as X and 2X or are there some latent variables underlying
one or more of the ones you have included that affect both? To combine predictors, it is possible to
include interaction terms (the product of the two that are correlated). Additionally, you could also (l)
center the data and (2) try to obtain a larger size of sample, thereby giving you narrower confidence
intervals. Lastly, you can apply regularization methods (such as in ridge regression).
Solution #7.6
Random forests are used since individual decision trees are usually prone to overfitting. Not only can
these utilize multiple decision trees and then average their decisions, but they can be used for either
classification or regression. There are a few main ways in which they allow for stronger out-of-
sample prediction than do individual decision trees.
As in other ensemble models, using a large set of trees created in a resample of the data
(bootstrap aggregation) will lead to a model yielding more consistent results. More specifically,
and in contrast to decision trees, it leads to diversity in training data for each tree and so
contributes to better results in terms of bias-variance trade-off (particularly with respect to
variance).
Using only m < p features at each split helps to de-correlate the decision trees, thereby avoiding
having very important features always appearing at the first splits of the trees (which would
happen on standalone trees due to the nature of information gain).
They're fairly easy to implement and fast to run.
They can produce very interpretable feature-importance values, thereby improving model
understandability and feature selection.
The first two bullet points are the main ways random forests improve upon single decision trees.
Solution #7.7
Step l: Clarify the Missing Data
Since these types of problems are generally context dependent, it's best 10 start your answer With
clarifying questions. For example,
It would also be useful to think about why the data is missing, because this affects how you'd impute
the data. Missing data is commonly classified as:
Missing completely at random (MCAR): the probability of being missing is the same for all
classes
Missing at random (MAR): the probability of being missing is the same within groups defined by
the observed data
Not missing at random (NMAR): if the data is not MCAR and not MAR
Solution #7.8
There are several possible ways to improve the performance of a logistic regression:
Normalizing features: The features should be normalized such that particular weights do not
dominate within the model.
Adding additional features: Depending on the problem, it may simply be the case that there
aren't enough useful features. In general, logistic regression is high bias, so adding more features
should be helpful.
Addressing outliers: Identify and decide whether to retain or remove them.
Selecting variables: See if any features have introduced too much noise into the process.
Cross validation and hyperparameter tuning: Using k-fold cross validation along with
hyperparameter tuning (for example, introducing a penalty term for regularization purposes)
should help improve the model.
The classes may not be linearly separable (logistic regression produces linear decision
boundaries), and, therefore, it would be worth looking into SVMs, tree-based approaches, or
neural networks instead.
Solution #7.9
For regular regression, recall we have the following for our least squares estimator:
T −1 T
β=(X X ) X y
(( ) ( )) ( ) ( )
T −1 T
β= X X X X
X X X X
Simplifying yields:
T −1 T
β=(2 X X ) 2 X y
Solution #7.10
In both gradient boosting and random forests, an ensemble of decision trees are used. Additionally,
both are flexible models and don't need much data preprocessing.
However, there are two main differences. The first main difference is that, in gradient boosting, trees
are built one at a time, such that successive weak learners learn from the mistakes of preceding
weak learners. In random forests, the trees are built independently at the same time.
The second difference is in the output: gradient boosting combines the results of the weak learners
with each successive iteration, whereas, in random forests, the trees are combined at the end
(through either averaging or majority).
Because of these structural differences, gradient boosting is often more prone to overfitting than are
random forests due to their focus on mistakes over training iterations and the lack of independence
in tree building. Additionally, gradient boosting hyperparameters are harder to tune than those of
random forests. Lastly, gradient boosting may take longer to train than random forests because the
trees of the latter are built sequentially. In real-life applications, gradient boosting generally excels
when used on unbalanced datasets (fraud detection, for example), whereas random forests
generally excel at multi-class object detection with noisy data (computer vision, for example).
Solution #7.11
Because "accurate enough" is subjective, it's best to ask the interviewer clarifying questions before
addressing the lack of training data, To stand out, you can also proactively mention ways to source
more training data at the end of your answer.
If after learning curves you realize that you don't have sufficient data to build an accurate enough
model, the interviewer would likely shift the discussion to dealing with this lack of data. Or, if you are
feeling like an overachiever with a can-do attitude, you could proactively bring up these discussion
points:
Are there too few features? If so, you want to look into adding features like marketplace supply
and demand indicators, traffic patterns on the road at the time of the delivery, etc.
Are there too many features? If there are almost as many or more features than data points,
then our model will be prone to overfitting and we should look into either dimensionality
reduction or feature selection techniques.
Can different models be used that handle smaller training datasets better?
Is it possible to acquire more data in a cost-effective way?
Is the less accurate ETA model a true launch blocker? If we launched in the new market, which
generates more training data, can the ETA model be retrained?
Solution #7.12
Without looking at features, we could look at partial dependence plots (also called response curves)
to assess how any one feature affects the model’s decision. A partial dependence plot shows the
marginal of a feature on the predicted target of a machine learning model. So, after the model is fit,
we can take all the features and start plotting them individually against the loan approval/ rejection,
while keeping all the other features constant.
For example, if we believe that FICO score has a strong relationship to the predicted probability of
loan rejection, then we can plot the loan approvals and rejections as we adjust the FICO score from
low to higher. Thus, we can get an idea of how features impact the model without explicitly looking
at feature weights, and supply reasons for rejection accordingly.
As a concrete example, consider having four applicants: 1, 2, 3, and 4. Assume that the features
include annual income, current debt, number of credit cards, and FICO score. Suppose we have the
following situation:
1. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 700.
2. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 720.
3. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 600.
4. $100,000 income, $10,000 debt, 2 credit cards, and FICO score of 650.
If 3 and 4 were rejected but 1 and 2 were accepted, then we can intuitively reason that a lower FICO
score was the reason the model made the rejections. This is because the remaining features are
equal, so the model chose to reject 3 and 4 "all-else-equal" versus 1 and 2.
Solution #7.13
To find synonyms, we can first find word embeddings through a corpus of words. Word2vec is a
popular algorithm for doing so. It produces vectors for words based on the words' contexts. Vectors
that are closer in Euclidean distance are meant to represent words that are also closer in context and
meaning. The word embeddings that are thus generated are weights on the resulting vectors. The
distance between these vectors can be used to measure similarity, for example, via cosine similarity
or some other similar measure.
Once we have these word embeddings, we can then run an algorithm such as K-means clustering to
identify clusters within word embeddings or run a K-nearest neighbor algorithm to find a particular
word for which we want to find synonyms. However, some edge cases exist, since word2vec can
produce similar vectors even in the case of antonyms; consider the words "hot" and "cold," for
example, which have opposite meanings but appear in many similar contexts (related to
temperature or in a Katy Perry song).
Solution #7.14
The bias-variance trade-off is expressed as the following: Total model error = Bias + Variance +
Irreducible error. Flexible models tend to have low bias and high variance, whereas more rigid
models have high bias and low variance. The bias term comes from the error that occurs when a
model underfits data. Having a high bias means that the machine learning model is too simple and
may not adequately capture the relationship between the features and the target. An example
would be using linear regression when the underlying relationship is nonlinear.
The variance term represents the error that occurs when a model overfits data. Having a high
variance means that a model is susceptible to changes in training data, meaning that it is capturing
and so reacting to too much noise. An example would be using a very complex neural network when
the true underlying relationship between the features and the target is simply a linear one.
The irreducible term is the error that cannot be addressed directly by the model, such as from noise
in data measurement.
When creating a machine learning model, we want both bias and variance to be low, because we
Want to be able to have a model that predicts well but that also doesn't change much when it is fed
new data. The key point here is to prevent overfitting and, at the same time, to attempt to retain
sufficient accuracy.
Solution #7.15
Cross validation is a technique used to assess the performance of an algorithm in several resamples/
subsamples of training data. It consists of running the algorithm on subsamples of the training data,
such as the original data less some of the observations comprising the training data, and evaluating
model performance on the portion of the data that was not present in the subsample. This process is
repeated many times for the subsamples, and the results are combined at the end. This step is very
important in ML because it reveals the quality and consistency of the model's true performance.
3. Average the 𝓀 validation errors from step 2 to get an estimate of the true error.
validation error using block i.
This process aids in accomplishing the following: (1) avoiding training and testing on the same
subsets of data points, which would lead to overfitting, and (2) avoiding using a dedicated validation
set, with which no training can be done. The second of these points is particularly important in cases
of this process, however, is that roughly 𝓀 times more computation is needed than using a
where very little training data is available or the data collection process is expensive. One drawback
dedicated holdout validation set. In practice, cross validation works very well for smaller datasets.
Solution #7.16
Step 1: Clarify Lead Scoring Requirements
Lead scoring is the process of assigning numerical scores for any leads (potential customers) in a
business. Lead scores can be based on a variety of attributes, and help sales and marketing teams
prioritize leads to try and convert them to customers.
As always, it's smart to ask the interviewer clarifying questions. In this case, learning more about the
requirements for the lead scoring algorithm is critical. Questions to ask include:
Are we building this for our own company's sales leads? Or, are we building an extensible version
as part of the Salesforce product?
Are there any business requirements behind the lead scoring (i.e., does it need to be easy to
explain internally and/or externally)?
Are we running this algorithm only on companies in our sales database (CRM), or looking at a
larger landscape of all companies?
For our solution, we'll assume the interviewer means we want to develop a leading scoring model to
be used internally — that means using the company's internal sales data to predict whether a
prospective company will purchase a Salesforce product.
After selecting features, it is good to conduct the standard set of feature engineering best practices.
Note that the model will only be as good as the data and judgement in feature engineering applied
— in practice, many companies that predict lead scoring can face issues with missing data or lack of
relevant data.
Solution #7.17
Collaborative filtering would be a commonly used method for creating a music recommendation
algorithm. Such algorithms use data on what feedback users have provided on certain items (songs
in this case) in order to decide recommendations. For example, a well-known use case is for movie
recommendation on Netflix. However, there are several differences compared to the Netflix case:
Feedback for music does not have a 1-to-5 rating scale as Netflix does for its movies.
Music may be subject to repeated consumption; that is, people may watch a movie once or
twice but will listen to a song many times over.
Music has a wider variety (i.e., niche music).
The scale of music catalog items is much larger than movies (i.e., there are many more songs
than movies).
Therefore, a user-song matrix (or a user-artist matrix) would constitute the data for this issue, with
the rows of the dataset being users and the columns various songs. However, in considering the first
point above, since explicit ratings are lacking, we can employ a binary system to count the number of
times a song is streamed and store this count.
We can then proceed with matrix factorization. Say there are M songs and N users in the matrix,
which we will label R. Then, we want to solve: R = PQT
T
where user preferences are captured by the vectors: = T ❑ =q❑ pu
Various methods can be used for this matrix factorization; a common one is alternating least squares
(ALS), and, since the scale of the data is large, this would likely be done through distributed
computing. Once the latent user and song vectors are discovered, then the above dot product will be
able to predict the-relevance of a particular song to a user. This process can be used directly for
recommendation at the user level, where we sort by relevance prediction on songs that the user has
not yet streamed. In addition, the vectors given above can be employed in such tasks as assessing
similarity between different users and different songs using a method such as kNN (K-nearest
neighbors).
Solution #7.18
Mathematically, a convex function f satisfies the following for any two points x and y in the domain
of f: f ( (1−a ) x +ay ) ≤ (1−a ) f ( x ) +af ( y ) , 0 ≤ a≤ 1
That is, the line segment from x to y lies above the function graph off for any points x and y.
Convexity matters because it has implications about the nature of minima in f. Stated more
specifically, any local minimum of f is also a global minimum.
Neural networks provide a significant example of non-convex problems in machine learning. At a
high is because neural networks are universal function approximators, meaning that they can (with a
sufficient number of neurons) approximate any function arbitrarily well. Because not all functions
are convex (convex functions cannot approximate non-convex ones), by definition, they must be
non-convex. In particular, the cost function for a neural network has a number of local minima; you
could interchange parameters of different nodes in various layers and still obtain exactly the Same
cost function output (all inputs/outputs the same, but with nodes swapped). Therefore, there is no
particular global minima, so neural networks cannot be convex.
Solution #7.19
Because information gain is based on entropy, we'll discuss entropy first.
The equation above yields the amount of entropy present and shows exactly how homogeneous a
sample is (based on the attribute being split). Consider a case where k = 2. Let a and b be two
outputs/ labels that we are trying to classify. Given these values, the formula considers the
proportion of values in the sample that are a and the proportion that are b, with the sample being
split on a different attribute.
A completely homogeneous sample will have an entropy of 0. For instance, if a given attribute has
values a aid b, then the entropy of splitting on that given attribute would be
Entropy=−1∗log 2 ( 1 )−0∗log 2 (0)=0
whereas a completely split (50%-50%) would result in an entropy of 1. A lower entropy means a
more homogeneous sample.
Information gain is based on the decrease in entropy after splitting on an attribute.
IG ( X j ,Y ) =H ( Y )−H (Y ∨X j)
This concept is better explained with a simple numerical example. Consider the above case again
with k = 2. Let's say there are 5 instances of value a and 5 instances of value b. Then, we decide to
split on some attribute X. When X = 1, there are 5 a's and 1 b, whereas when X = 0, there are 4 b's
and 0 a's.
Now, by splitting on X, we have two classes: X = 1 and X = 0. However, by splitting on this attribute,
we now have X = 1, which has 5 a’s and 1 b, while X = 0 has 4 b's and 0 a's.
Solution #7.20
In machine learning, L1 and L2 penalization are both regularization methods that prevent overfitting
by coercing the coefficients of a regression model towards zero. The difference between the two
methods is the form of the penalization applied to the loss function. For a regular regression model,
assume the loss function is given by L. Using L1 regularization, the least absolute shrinkage and
selection operator, or Lasso, adds the absolute value of the coefficients as a penalty term, whereas
ridge regression uses L2 regularization, that is, adding the squared magnitude of the coefficients as
the penalty term.
The loss function for the two are thus the following:
( )
n n p 2
L=∑ ( y i−f ( xi ) ) =∑ y i−∑ ( x ij w j )
2
for linear regression
i=1 i=1 j=1
If we run gradient descent on the weights w, L 1 regularization forces any weight closer to 0,
irrespective of its magnitude, whereas with L 2 regularization, the rate at which the weight
approaches 0 becomes slower as the weight approaches 0. Because of this, L 1 is more likely to "zero"
out particular weights and hence completely remove certain features from the model, leading to
models of increased sparseness.
Solution #7.21
The gradient descent algorithm takes small steps in the direction of steepest descent to optimize a
particular objective function. The size of the "steps" the algorithm takes are proportional to the
negative gradient of the function at the current value of the parameter being sought. The stochastic
version of the algorithm, SGD, uses an approximation of the nonstochastic gradient descent
algorithm instead of the function's actual gradient. This estimate is done by using only one randomly
selected sample at each step to evaluate the derivative of the function, making this version of the
algorithm much faster and more attractive for situations involving lots of data. SGD is also useful
when redundancy in the data is present (i.e., observations that are very similar).
Assume function f at some point x and at time t. Then, the gradient descent algorithm will update x
as follows until it reaches convergence:
x t +1=x 1−at ∇ f ( x 1 )
That is, we calculate the negative of the gradient of f and scale that by some constant and move in
that direction at the end of each iteration.
Since many loss functions are decomposable into the sum of individual functions, then the gradient
step can be broken down into addition of discrete, separate gradients. However, for very large
datasets, this process can be computationally intensive, and the algorithm might become stuck at
local minima or at saddle points.
Therefore, we use the stochastic gradient descent algorithm to obtain an unbiased estimate of the
true gradient without going through all data points by uniformly selecting a point at random and
performing a gradient update then and there.
The estimate is therefore unbiased since we have the following:
n
1
∇ f ( x )= ∑ ∇ f ( x )
n i=1
Since the data are assumed to be i.i.d., for the SGD, the expectation of g(x) is: E [ g ( x ) ] =∇ f ( x )
where g(x) is the stochastic gradient descent.
Solution #7.22
Recall that the ROC curve plots the true positive rate versus the false positive rate. If all scores
change simultaneously, then none of the actual classifications change (since thresholds are
adjusted): leading to the same true positive and false positive rates, since only the relative ordering
of the scores matters. Therefore, taking a square root would not cause any change to the ROC curve
because the relative ordering has been maintained. If one application had a score of X and another a
score of Y, and if Y > X, then √ Y > √ X still. Only the model thresholds would change.
In contrast, any function that is not monotonically increasing would change the ROC curve, since the
relative ordering would not be maintained. Some simple examples are the following:
Solution #7.23
We have: X N ( μ , σ 2 ), and entropy for a continuous random variable is given by the following:
∞
H ( x )=−∫ p ( x ) log p ( x ) dx
−∞
2
(x−μ)
For a Gaussian, we have the following: p ( x )= 1 e
2
2a
σ √2 π
Substituting into the above equation yields
( )
∞ ∞ 2
− ( x−μ )
H ( x )=−∫ p ( x ) log σ √ 2 π dx− ∫ p ( x ) 2
log ( e ) dx
−∞ −∞ 2a
∞
where the first term equals −log σ √ 2 π ∫ p ( x ) dx=−log σ √ 2 π
−∞
since the integral evaluates to 1 (by the definition of a probability density function). The second term
is given by:
∞ 2
1 σ 1
2∫
2
p ( x ) ( x−μ ) dx= 2 =
2 σ −∞ 2σ 2
since the inner term is the expression for the variance. The entropy is therefore as follows:
1
H ( x )= + logσ √2 π
2
Solution #7.24
The standard approach is (1) to construct a large dataset with the variable of interest (purchase or
not) and relevant covariates (age, gender, income, etc.) for a sample of platform users and (2) to
build a model to calculate the probability of purchase of each item. Propensity models are a form of
binary classifier, so any model that can accomplish this could be used to estimate a customer 's
propensity to buy the product.
In selecting a model, logistic regression offers a straightforward solution with an easily interpretable
result: the resulting log-odds is a probability score for, in this case, purchasing a particular item.
However, it cannot capture complex interaction effects between different variables and could also be
numerically unstable under certain conditions (i.e., correlated covariates and a relatively small user
base).
An alternative to logistic regression would be to use a more complex model, such as a neural
network or an SVM. Both are great with dealing with high-dimensional data and with capturing the
complex interactions that logistic regression cannot. However, unlike logistic regression, neither is
easy to explain, and both generally require a large amount of data to perform well.
A good compromise is tree-based models, such as random forests, which are typically highly
accurate and are easily understandable. With tree-based models, the features which have the
highest influence on predictions are readily perceived, a characteristic that could be very useful in
this particular case.
Solution #7.25
Both Gaussian naive Bayes (GNB) and logistic regression can be used for classification. The two
models each have advantages and disadvantages, which provide the answer as to which to choose
under what circumstances. These are discussed below, along with their similarities and differences:
Advantages:
1. GNB requires only a small number of observations to be adequately trained; it is also easy to use
and reasonably fast to implement; interpretation of the results produced by GNB can also be
highly useful.
2. Logistic regression has a simple interpretation in terms of class probabilities, and it allows
inferences to be made about features (i.e., variables) and identification of the most relevant of
these with respect to prediction.
Disadvantages:
2. Logistic regression requires an optimization setup (where weights cannot be learned directly
through counts), whereas GNB requires no such setup.
Similarities:
1. Both methods are linear decision functions generated from training data.
2. GNB's implied P(YIX) is the same as that of logistic regression (but With particular parameters).
Given these advantages and disadvantages, logistic regression would be preferable assuming training
provided data size is not an issue, since the assumption of conditional independence breaks down if
features are correlated. However, in cases where training data are limited or the data-generating
process includes strong priors, using GNB may be preferable.
Solution #7.26
Assume we have 𝓀 clusters and n sample points: x 1 … x n , μ 1 … μk
The loss function then consists of minimizing total error using a squared L2 norm (since it is a good
way to measure distance) for all points within a given cluster:
h
L=∑ ∑ ‖x i−μ j‖
2
j=1 xε S j
For batch gradient descent, the update step is then given by the following:
μk =μ k +∈ ∑ 2 ( x i−μk )
xi ε Sk
However, for stochastic gradient descent, the update step is given by the following:
μk =μ k +∈ ( x t −μk )
Solution #7.27
The idea behind the kernel trick is that data that cannot be separated by a hyperplane in its current
dimensionality can actually be linearly separable by projecting it onto a higher dimensional space.
T
k ( x , y )=( x ) ( y )
and we can take any data and map that data to a higher dimension through a variety of functions .
However, if is difficult to compute, then we have a problem — instead, it is desirable if we can
compute the value of k without blowing up the computation.
For instance, say we have two examples and want to map them to a quadratic space. We have the
following:
[]
1
x1
x
( x 1 , x 2 )= 22
x1
2
x2
x1 x 2
Solution #7.28
Assume we have some dataset X consisting of n i.i.d observations: x 1 , … , x n
Our likelihood function is then
n (x 1−μ)
1
p ( X|μ , σ ) =∏ N ( x i|μ , σ ) where N ( x i|μ , σ ) =
2 2 2 2
2a
e
i=1 σ √2 π
And therefore the log-likelihood is given by:
n n
1 n n
log p ( X|μ , σ ) =¿ ∑ log N ( x i|μ , σ ) = 2∑
2 2 2 2
(x i−μ) − log σ − log π ¿
i=1 2 σ i=1 2 2
Taking the derivative of the log-likelihood with respect to and setting the result to 0 yields the
following:
d log p ( X| μ , σ 2 ) 1 n
= 2 ∑ ( xi −μ)=0
dμ σ i=1
n n
Simplifying the result yields: ∑ x 1=∑ μ^ =n ^μ, and therefore the maximum likelihood estimate for
i=1 i =1
µ is given by:
n
1
^μ= ∑ x i
n i=1
To obtain the variance, we take the derivative of the log likelihood with respect to 2 and set the
result equal to 0
d log p ( X| μ , σ 2 ) 1
n
n
dσ
2
= 4 ∑
2 σ i=1
2
(x i−μ) − 2 =0
2σ
n
1 n
4∑
Simplifying yields the following: (x i−μ)2− 2
2 σ i=1 2σ
n
∑ (x i−μ)=n σ 2
i=1
The maximum likelihood estimate for the variance is thus given by the following:
Solution #7.29
The GMM model assumes a Gaussian probability distribution function, across K classes:
k
p ( x )=∑ π k N x μ k , ∑
k=1 (| k )
where the coefficients are the mixing coefficients on the clusters and are normalized so that they
sum to 1.
The posterior probabilities for each cluster are given by Bayes' rule and can be interpreted as "what
is the probability of being in class k given the data x":
p(k ) p ( k|x )
z k = p ( k|x ) =
p( x)
∑ π k N ( x|μ k ∑ )
k=1 k
The unknown set of parameters consists of the mean and variance parameters for each of the K
classes, along with the K coefficients. The likelihood is therefore given by:
n n k
p ( θ|x )=∏ p (x)=∏ ∑ π k N x μk , ∑
i=1 i=1 k =1 (| k )
n k
and therefore the log-likelihood is log p ( θ| x ) =∑ log ∑ π k N x μ k , ∑
i=1 k=1 (| k )
The parameters can be calculated iteratively using expectation-maximization and the information
above, After the model has been trained, for any new transaction we can then calculate the
posterior probabilities of any new transactions over the k classes as above. If the posterior
probabilities calculated are low, then the transaction most likely does not belong to any of the K
classes, so we can deem it to be fraudulent.
Solution #7.30
Step 1: Clarify What Churn Is & Why It's Important
First, it is important to clarify with your interviewer what churn means. Generally, the word "churn"
defines the process of a platform's loss of users over time.
To determine what qualifies as a churned user at Robin-hood, it's helpful to first follow the money
and understand how Robinhood monetizes. One primary way is by trading activity — whether it is
through their Robinhood Gold offering or order flow sold to market makers like Citadel. Thus, a
cancellation of their Robinhood Gold membership or a long period of no trading activity could
constitute churn. The other way Robinhood monetizes is through a user's account balance. By
collecting interest on uninvested cash and making stock loans to counterparties, Robinhood is
and therefore to be avoided if possible. So, if Robinhood is to stay ahead of WeBull, Coinbase, and TD
Ameritrade: predicting who will churn, and then helping these at-risk users, is beneficial.
After you've worked with your interviewer to clarify what churn is in this context, and why it's
important to mitigate, be sure to ask the obvious question: how is my model output going to be
used? If it’s not clear how the model will be used for the business, then even if the model has great
predictive power, it is not useful in practice.
Solution #7.31
In matrix form, we assume Y is distributed as multivariate Gaussian: Y N ( Xβ , σ 2 I )
−n
Solution #7.32
PCA aims to reconstruct data into a lower dimensional setting, and so it creates a small number of
linear combinations of a vector x (assume it to be p dimensional) to explain the variance within x.
More specifically, we want to find the vector w of weights such that we can define the following
linear combination:
p
y 1=wTi x=∑ wij x j
j=1
subject to the constraint that w is orthonormal and that the following is true:
y i is uncorrelated with y j, var ( y i ) is maximized
T
Hence, we perform the following procedure, in which we first find y 1=w1 x with maximal variance,
meaning that the scores are obtained by orthogonally projecting the data onto the first principal
T
direction, w1. We then find y 2=w2 x 1 is uncorrelated with y1 and has maximal variance, and we
continue this procedure iteratively until ending with the kth dimension such that
Y1,…yk explain the majority of variance, k << p
To solve, note that we have the following for the variance of each y, utilizing the covariance matrix of
T T
x: var ¿ )= w i var ( x ) wi =wi ∑ wi
Without any constraints, we could choose arbitrary weights to maximize this variance, and hence we
T
will normalize by assuming orthonormality of w, which guarantees the following: w i w i=1
We now have a constrained maximization problem where we can use Lagrange multipliers.
Specifically, we have the function w i
T
∑ w i−❑i (w Ti w i−1 )=0, which we differentiate with respect
to w to solve the optimization problem:
d T
wi ∑ wi−❑i ( wi w i−1 ) =∑ w i−❑i ( wi ) =0
T
d wi
This is the result of an eigen-decomposition, whereby w is the eigenvector of the covariance matrix
and is this vector's associated eigenvalue. Noting that we want to maximize the variance for each y,
we pick: w i ∑ w i=w i ❑i wi=❑i w i w i=❑i to be as large as possible. Hence, -we choose the first
T T T
eigenvalue to be the first principal component, the second largest eigenvalue to be the second
principal component, and so on.
Solution #7.33
Logistic regression aims to classify X into one of k classes by calculating the following:
P(C=i∨X =x)
log =β + β T x
P(C=K∨ X=x ) 10 1
Therefore, the model is equivalent to the following, where the denominator normalizes the
numerator over the k classes:
❑
e
P ( C=k|X =x )= K
∑
❑
e
❑
We note that:
n n n βn β1 xi
∂ ∂
=∑ log (1−e β n β1 x i
)= ∑ −log ( 1+ e β n β1 x i
)=∑ −¿ e β β x ¿
∂ β i=1 ∂ β i=1 i=1 1+e n 1 i
The solutions to these equations are not closed form, however, and, hence, the above should be
iterated until convergence.
Solution #7.34
Step 1: Clarify Details of Discover Weekly
First we can ask some clarifying questions:
What is the goal of the algorithm?
Do we recommend just songs, or do we also include podcasts?
Is our goal to recommend new music to a user, and push their musical boundaries? Or is it to just
give them the music they'll want to listen to the most, so they spend more time on Spotify? Said
more generally, how do we think about the trade-off of exploration versus exploitation?
What are the various service-level agreements to consider (e.g., does this playlist need to change
week to week if the user doesn't listen to it?)
Do new users get a Discover Weekly playlist?
Step 2: Describe What Data Features You'd Use
The core features will be user-song interactions. This is because users' behaviors and reactions to
various songs should be the strongest signal for whether or not they enjoy a song. This approach is
similar to the well-known use case for movie recommendations on Netflix, with several notable
differences:
Feedback for music does not have a 1-to-5 rating scale as Netflix does for its movies.
Music may be subject to repeated consumption (i.e., people may watch a movie once or twice
but will listen to a song many times).
Music has a wider variety (i.e., niche music).
The scale of music catalog items is much larger than movies (i.e., there are many more songs
than movies).
There is also a variety of other features outside of user-song interactions that could be interesting to
consider. For example, we have plenty of metadata about the song (the artist, the album, the
playlists that include that song) that could be factored in. Additionally, potential audio features in the
songs themselves (tempo, speechiness, instruments used) could be used. And finally, demographic
information (age, gender, location, etc.) can also impact music listening preferences — people living
in the same region are more likely to have similar tastes than someone living on the other side of the
globe.
Collaborative-filtering uses data from feedback users have provided on certain items (in this case,
songs) in order to decide recommendations. Therefore, a user-song matrix (or a user-artist matrix)
would constitute the dataset at hand, with the rows of the dataset being users and the columns
various songs. However, as discussed in the prior section, since explicit song ratings are lacking, we
can proxy liking a song by using the number of times a user streamed it. This song play count is
stored for every entry in the user-song matrix.
The output of collaborative filtering is a latent user matrix and a song matrix. Using vectors from
these matrices, a dot product denotes the relevance of a particular song to a particular user. This
process can be used directly for recommendation at the user level, where we sort by relevance
scores on songs that the user has not yet streamed. You can also use these vectors to assess
similarity between different users and different songs using a method such as kNN (K-nearest
neighbors).
Solution #7.35
We are attempting to solve for Var( ^β )
Recall that the parameter estimates have the following closed-form solution in matrix form:
^β=(X T X )−1 X T y
To derive the variance of the estimates, recall that for any given random variable X:
Var ( X )=E [ X 2 ]−E [ X 2 ]
2
Therefore, we have the following: Var ( ^β )=E ( β^ 2 ) −E [ ^β ]
We can evaluate the second term since we can assume the parameter estimates are unbiased.
Therefore, E ( ^β )= β
2 2
Var (β ̂ )=E(β ̂ )−β
Substituting into the closed-form solution yields the following: Var ( ^β )=E [(( X T X )−1 X T y ¿ ¿ 2 ¿−β 2
Since least squares assumes that: y= Xβ+ ϵ where ϵ N ( 0 , σ 2), we have the following:
Var ( ^β )=E ¿
−1
Note that: ( X T X ) T
X =1
So simplifying yields: Var ( ^β )=E [ ( β+ ( X T X ) X ϵ ) ]−β
−1 2
T 2
where the middle terms were canceled since the expectation of the error term is 0. Canceling out
the first and last squared terms and simplifying the middle part yields the following:
Var ( ^β )=E [ ( X X ) X X ( X X ) ϵ ]=( X X ) E ( ϵ )= ( X X ) σ
T −1 T T −1 2 T −1 2 T −1 2
Upon hearing the term "data scientist, " buzzwords such as predictive analytics, big data,
and deep learning may leap to mind. So, let's not beat around the bush: data wrangling isn't
the most fun or sexy part of being a data scientist. However, as a data scientist, you will
likely spend a great deal of your time working writing SQL queries to retrieve and analyze
data. As such, almost every company you interview with will test your ability to write SQL
queries. These questions are practically guaranteed if you are interviewing for a data
scientist role on a product or analytics team, or if you're after a data science-adjacent role
like data analyst or business intelligence analyst. Sometimes, data science interviews may go
beyond just writing SQL queries, and cover the basic principles of database design and other
big data systems. This focus on data architecture is particularly true at early-stage startups,
where data scientists often take an active role in data engineering and data infrastructure
development.
SQL
How SQL Interview Questions Are Asked
Because most analytics workflows require quick slicing and dicing of data in SQL, interviewers will
often present you with hypothetical database tables and a business problem, and then ask you to
For example, at a company like Facebook, you might be given a table on user analytics and asked to
calculate the month-to-month retention. Here, it's relatively straightforward what the query should
be, and you're expected to write it. Some companies might make their SQL interview problems:
more open-ended. For example, Amazon might give you tables about products and purchases and
then ask you to list the most popular products in each category. Robinhood may give you a table and
ask why users are churning. Here, the tricky part might not be just writing the SQL query, but also
figuring out collaboratively with the interviewer what "popular products" or "user churn" means in
the first place.
Finally, some companies might ask you about the performance of your SQL query. While these
interview questions are rare, and they don't expect you to be a query optimization expert, knowing
how to structure a database for performance, and avoid slow running queries, can be helpful. This
knowledge can come in handy as well when you are asked more conceptual questions about
database design and SQL.
Joins
Imagine you worked at Reddit, and had two separate tables: users and posts.
INNER JOIN
Inner joins combine multiple tables and will preserve the rows where column values match in the
tables being combined. The word INNER is optional and is rarely used because it's the default type of
join. As an example, we use an inner join to find the number of Reddit users who have made a post:
SELECT
COUNT (DISTINCT user_id)
FROM
Users
JOIN posts p user_id p user_id
A self join is a special case of an inner join where a table joins itself. The most common use case for a
self join is to look at pairs of rows within the same table,
LEFT JOIN
Left joins combine multiple tables by matching on the column names provided, while preserving all
the rows from the first table of the join. As an example, we use a left join to find the percentage of
users that made a post:
RIGHT JOIN
Right joins combine multiple tables by matching on the column names provided, while preserving all
the rows from the second table of the join. For example, we use the right join to find the percentage
of posts made where the user is located in the U.S.:
Join Performance
Joins are an expensive operation to process, and are often bottlenecks in query runtimes. As such, to
write efficient SQL, you want to be working with the fewest rows and columns before joining two
tables together. Some general tips to improve join performance include the following:
Select specific fields instead of using SELECT *
Use LIMIT in your queries
Filter and aggregate data before joining
Avoid multiple joins in a single query
Filtering
SQL contains various ways to compare rows, the most common which use = and <> (not equal),>,
and <, along with regex and other type of logical and filtering causes such as OR and AND. For
example, below we filter to active Reddit users from outside the U.S.:
Subqueries serve a similar function to CTEs, but are inline in the query itself and must have a unique
alias for the given scope.
CTEs and subqueries are mostly similar, with the exception that CTEs can be used recursively. Both
concepts are incredibly important to know and practice, since most of the harder SQL interview
questions essentially boil down to breaking the problem into smaller chunks of CT Es and
subqueries.
Partition Specification: separates rows into different partitions, analogous to how GROUP BY
operates. This specification is denoted by the clause PARTITION BY
Ordering Specification: determines the order in which rows are processed, given by the clause
ORDER BY
Window Frame Size Specification: determines which sliding window of rows should be
processed for any given row. The window frame defaults to all rows within a partition but can be
specified by the clause ROWS BETWEEN (start, end)
For instance, below we use a window function to sum up the total Reddit posts per user, and then
add each post count to each row of the users table:
Note that a comparable version without using window functions looks like the following:
As you can see, window functions tend to lead to simpler and more expressive SQL.
RANK
Say that for each user, we wanted to rank posts by their length. We can use the window function
RANK() to rank the posts by length for each user:
managing data pipelines. Besides generic questions about scaling up data infrastructure, you might
be asked conceptual questions about popular large-scale processing frameworks (Hadoop, Spark) or
orchestration frameworks (Airflow, Luigi) — especially if you happen to list these technologies on
your resume.
Keys allow us to split data efficiently into separate tables, but still enforce a logical relationship
between two tables, rather than having everything duplicated into one table. This process of
generally separating out data to prevent redundancy is called normalization. Along with reducing
redundancy, normalization helps you enforce database constraints and dependencies, which
improves data integrity.
The disadvantage to normalization is that now we need an expensive join operation between the
two related tables. As such, in high-performance systems, denormalization is an optimization
technique where we keep redundant data to prevent expensive join operations. This speeds up read
times, but at the cost of having to duplicate data. At scale, this can be acceptable since storage is
cheap, but compute is expensive.
When normalization comes up in interviews, it often concerns the conceptual setup of database
tables: why a certain entity should have a foreign key to another entity, what the mapping
relationship is between two types of records (one-to-one, one-to-many, or many-to-many), and
when it might be advantageous to denormalize a database.
Although the CAP theorem is a theoretical framework, one should consider the real-life trade-offs
that need to be made based on the needs of the business and those of the database's users. For
example, the Instagram feed focuses on availability and less so on consistency, since what matters is
that you get a result instantly when visiting the feed. The penalty for inconsistent results isn't high.
It's not going to crush users to see @ChampagnePapi's last post has 57,486 likes (instead of the
correct 57,598 likes). In contrast, when designing the service to handle payments on Whatsapp,
you'd favor consistency over availability, because you'd want all servers to have a consistent view of
how much money a user has to prevent people from sending money they didn't have. The downside
is that sometimes sending money takes a minute or a payment fails and you are asked to re-try. Both
are reasonable trade-offs in order to prevent double-spend issues.
The second principle for measuring the correctness and completeness of a database transaction is
called the ACID framework. ACID is an acronym derived from the following desirable characteristics:
Atomicity: an entire transaction occurs as a whole or it does not occur at all (i.e., no partial
transactions are allowed). If a transaction aborts before completing, the database does a
"rollback" on all such incomplete transactions. This prevents partial updates to a database, which
cause data integrity issues.
Consistency: integrity constraints ensure that the database is consistent before and after a given
transaction is completed. Appropriate checks handle any referential integrity for primary and
foreign keys.
Isolation: transactions occur in isolation so that multiple transactions can occur independently
without interference. This characteristic properly maintains concurrency.
Durability: once a transaction is completed, the database is properly updated with the data
associated with that transaction, so that even a system failure could not remove that data from
it.
The ACID properties are particularly important for online transactional processing (OLTP) systems,
where databases handle large volumes of transactions conducted by many users in real time.
Scaling Databases
Traditionally, database scaling was done by using full-copy clusters where multiple database servers
(each referred to as a node within the cluster) contained a full copy of the data, and a load balancer
would roundrobin incoming requests. Since each database server had a full copy of the data, each
node experienced the issues mentioned in the CAP theorem discussed above (especially during high-
load periods). With the advent of the cloud, the approach towards scaling databases has evolved
rapidly.
Nowadays, the cloud makes two main strategies to scaling feasible: vertical and horizontal scaling.
Vertical scaling; also known as scaling up, involves adding CPU and RAM to existing machines. This
approach is easy to administer and does not require Changing the way system is architected.
However, vertical scaling can quickly become prohibitively expensive, eventually limiting the scope
for upgrades. This is because certain machines may be close to their physical limits, making it
practically impossible to replace them with more performant servers.
In horizontal scaling, also known as scaling out, more commodity machines (nodes) are added to the
resource pool. In comparison to vertical scaling, horizontal scaling has a much cheaper cost structure
and has better fault tolerance than vertical scaling. However, as expected, there are trade-offs with
this approach. With many more nodes, you have to deal with issues that arise in any distributed
system, like handling data consistency between nodes. Therefore, horizontal scaling offers a greater
set of challenges in infrastructure management compared to vertical scaling.
Sharding, in which database rows themselves are split across nodes in a cluster, is a common
example of horizontal scaling. For all tables, each node has the same schema and columns as the
original table, but the data are stored independently of other shards. To split the rows of data, a
sharding mechanism determines which node (shard) that data for a given key should exist on. This
sharding mechanism can be a hash function, a range, or a lookup table. The same operations apply
for reading data as well, and so, in this way, each row of data is uniquely mapped to one particular
shard.
Column Graph
Another type of NoSQL database is the graph database. Ne04J is a well-known graph database,
which stores each data record along with direct pointers to all the other data records it is connected
to.
By making the relationships between the data as important as storing the data itself, graph
databases allow for a more natural representation of nodes and edges, when compared to relational
databases.
MapReduce
MapReduce is a popular data processing framework that allows for the concurrent processing of
large volumes of data. MapReduce involves four main steps:
1) Split step: splits up the input data and distributes it across different nodes
2) Map step: takes the input data and outputs <key, value> pairs
3) Shuffle step: moves all the <key, value> pairs with the same key to the same node
4) Reduce step: processes the <key, value> pairs and aggregates them into a final output
The secret sauce behind MapReduce’s efficiency is the shuffle step; by grouping related data onto
the same node, we can take advantage of the locality of data. Said another way, by shuffling the
related <key, value> pairs needed by the reduce step to the same node rather than sending them to
a different node for reducing, we minimize node-to-node communication, which is often the
bottleneck for distributed computing.
For a concrete example of how MapReduce works, assume you want to count the frequency of
words in a multi-petabyte corpus of text data. The MapReduce steps are visualized on left:
Here's how each MapReduce step operates in more detail:
1. Split step: We split the large corpus of text into smaller chunks and distribute the pieces to
different machines.
2. Map step: Each worker node applies a specific mapping function to the input data and writes the
output <key, value> pairs to a memory buffer. In this case, our mapping function Simply converts
each word into a tuple of the word and its frequency (which is always 1). For example, say we
had the phrase "hello world" on a single machine. The map step would convert that input into
two key value pairs: <”hello”,1> and <”world”,1>. We do this for the entire corpus, so that if our
corpus is words big, we end up with key-value pairs in the map step.
3. Shuffle step: Data is redistributed based on the output keys from the prior step's map function,
such that tuples with the same key are located on the same worker node. In this case, it means
that all tuples of <”hello”,1> will be located on the same worker node, as will all tuples of
<”world”,1>, and so on.
4. Reduce step: Each worker node processes each key in parallel using a specified reducer
operation to obtain the required output result. In this case, we just sum up the tuple counts for
each key, so if there are 5 tuples for <”hello”,1> then the final output will be <"hello", 5>,
meaning that the word "hello" occurred 5 times.
Because the shuffle step moved all the "hello" key-value pairs to the same node, the reducer can
operate locally and, hence, efficiently. The reducer doesn't need to communicate with other nodes
to ask for their "hello" key-value pairs, which minimizes the amount of precious node-to-node
bandwidth consumed.
In practice, since MapReduce is. just the processing technique, people rely on Hadoop to manage
the steps of the MapReduce algorithm. Hadoop involves:
1) Hadoop File System (HDFS): manages data storage, backup, and replication
2) MapReduce: as discussed above
3) YARN: a resource manager which manages job scheduling and worker node
orchestration
Spark is another popular open-source tool that provides batch processing similar to Hadoop, with a
focus speed and reduced disk operations. Unlike Hadoop, it uses RAM for computations enabling
faster in memory performance but higher running costs. Additionally, unlike Spark has built in
resource scheduling and monitoring, whereas MapReduce relies on external resource managers like
YARN.
8.2. Robinhood: Assume you are given the tables below containing information on trades and users.
Write a query to list the top three cities that had the most number of completed orders.
8.3.
New York Times: Assume that you are given the table below containing information on viewership by
device type (where the three types are laptop, tablet, and phone). Define "mobile" as the sum
of tablet and phone viewership numbers. Write a query to compare the viewership on laptops
versus mobile devices.
viewership
Aa column name type
user id integer
Device_type string
View_time datetime
8.4. Amazon: Assume you are given the table below for spending activity by product type. Write a
query to calculate the cumulative spend so far by date for each product over time in
chronological order.
Total_trans
Aa column name type
order id integer
user id integer
product_id string
spend float
trans date datetime
8.5. eBay: Assume that you are given the table below containing information on various orders
made by customers. Write a query to obtain the names of the ten customers who have ordered
the highest number of products among those customers who have spent at least $ 1000 total.
user_transactions
Aa column_name type
Transaction_id integer
product id integer
user id integer
spend float
trans date datetime
8.6. Twitter: Assume you are given the table below containing information on tweets. Write a query
to obtain a histogram of tweets posted per user in 2020.
tweets
Aa column_name type
tweet id integer
user id integer
msg string
tweet date datetime
8.7. Stitch Fix: Assume you are given the table below containing information on user purchases.
Write a query to obtain the number of people who purchased at least one or more of the same
product on multiple days.
purchases
Aa column name type
purchase id integer
user id integer
product id integer
quantity integer
price float
purchase_time datetime
8.8. Linkedin: Assume you are given the table below that shows the job postings for all companies
on the platform. Write a query to get the total number of companies that have posted
duplicate job listings (two jobs at the same company with the same title and description).
job_listings
Aa column name type
Job_id integer
Company_id integer
title string
description string
post_date datetime
8.9. Etsy: Assume you are given the table below on user transactions. Write a query to obtain the
list of customers whose first transaction was valued at $50 or more.
user_transactions
Aa column_name type
Transaction_id integer
product_id integer
User_id integer
spend float
transaction date datetime
8.10. Twitter: Assume you are given the table below containing information on each user's tweets
over a period of time. Calculate the 7-day rolling average of tweets by each user for every date.
tweets
Aa column_name type
tweet id integer
msg string
user id integer
Tweet_date datetime
8.11. Uber: Assume you are given the table below on transactions made by users. Write a query to
obtain the third transaction of every user.
transactions
Aa column_name type
user id integer
spend float
transaction date datetime
8.12. Amazon: Assume you are given the table below containing information on customer spend on
products belonging to various categories. Identify the top three highest-grossing items within
each category in 2020.
product_ spend
Aa column_name type
transaction id integer
category id integer
product_id integer
User_id integer
spend float
transaction date datetime
8.13. Walmart: Assume you are given the below table on transactions from users. Bucketing users
based on their latest transaction date, write a query to obtain the number of users who made a
purchase and the total number of products bought for each transaction date.
user_transactions
Aa column_name type
transaction id integer
product id integer
user id integer
spend float
transaction date datetime
8.14. Facebook: What is a database view? What are some advantages views have over tables?
8.15. Expedia: Say you have a database system where most of the queries made were UPDATEs/
INSERTs/DELETEs. How would this affect your decision to create indices? What if the queries
made were mostly SELECTs and JOINs instead?
8.16. Microsoft: What is a primary key? What characteristics does a good primary key have?
8.17. Amazon: Describe some advantages and disadvantages of relational databases vs. NoSQL
databases.
8.18. Capital One: Say you want to set up a MapReduce job to implement a shuffle operator, whose
input is a dataset and whose output is a randomly ordered version of that same dataset. At a
high level, describe the steps in the shuffle operator's algorithm.
8.19. Amazon: Name one major similarity and difference between a WHERE clause and a HAVING
clause in SQL.
8.20. KPMG: Describe what a foreign key is and how it relates to a primary key.
8.21. Microsoft: Describe what a clustered index and a non-clustered index are. Compare and
contrast the two.
Medium Problems
8.22. Twitter: Assume you are given the two tables below containing information on the topics that
each Twitter user follows and the ranks of each of these topics. Write a query to obtain all
existing users on 2021-01-01 that did not follow any topic in the 100 most popular topics for
that day.
8.23. Facebook: Assume you have the tables below containing information on user actions. Write a
query to obtain active user retention by month. An active user is defined as someone who took
an action (sign-in, like, or comment) in the current month.
user—actions
Aa column name type
user id integer
event_id string ("sign-in”, "like", "comment”)
timestamp datetime
8.24. Twitter: Assume you are given the tables below containing information on user session activity.
Write a query that ranks users according to their total session durations for each session type
between the start date (2021-01-01 ) and the end date (2021-02-01).
sessions
Aa column name type
Session_id integer
User_id integer
session_type string
duration integer
start time datetime
8.25. Snapchat: Assume you are given the tables below containing information on Snapchat users
and their time spent sending and opening snaps. Write a query to obtain a breakdown of the
time spent sending vs. opening snaps (as a percentage of total time spent) for each of the
different age groups.
activities age_breakdown
Aa column name type Aa column name type
Activity_id integer user id integer
8.26. Pinterest: Assume you are given the table below containing information on user sessions,
including their start and end times. A session is considered to be concurrent with another user's
session if they overlap. Write a query to obtain the user session that is concurrent with the
largest number of other user sessions.
sessions
Aa column_name type
session id integer
start time datetime
end time datetime
8.27. Yelp: Assume you are given the table below containing information on user reviews. Define a
top-rated business as one whose reviews contain only 4 or 5 stars. Write a query to obtain the
number and percentage of businesses that are top rated.
reviews
Aa column_name type
business id integer
user id integer
review_text string
Review_stars integer
Review_date datetime
8.28. Google: Assume you are given the table below containing measurement values obtained from
a sensor over several days. Measurements are taken several times within a given day. Write a
query to obtain the sum of the odd-numbered measurements and the sum of the even-
numbered measurements by date.
measurements
Aa column name type
measurement id integer
measurement value float
Measurement_time datetime
8.29. Etsy: Assume you are given the two tables below containing information on user signups and
user purchases. Of the users who joined within the past week, write a query to obtain the
percentage of users that also purchased at least one item.
8.30. Walmart: Assume you are given the following tables on customer transactions and products
Find the top 10 products that are most frequently bought together (purchased in the same
transaction).
transactions products
Aa column_name type Aa column_name type
tcansaction id integer product id integer
product id integer product name string
user id integer price float
quantity integer
Transaction_time datetime
8.31. Facebook: Assume you have the table given below containing information on user logins. Write
a query to obtain the number of reactivated users (i.e., those who didn't log in the previous
month, who then logged in during the current month).
user—logins
Aa column_name type
user id integer
login date datetime
8.32. Wayfair: Assume you are given the table below containing information on user transactions for
particular products. Write a query to obtain the year-on-year growth rate for the total spend of
each product, for each week (assume there is data each week).
user_transactions
Aa column_name type
transaction id integer
product id integer
User_id integer
spend float
transaction date datetime
8.34. Facebook: Say you had the entire Facebook social graph (users and their friendships). How
would you use MapReduce to find the number of mutual friends for every pair of Facebook
users?
8.35. Google: Assume you are tasked with designing a large-scale system that tracks a-variety of
search query strings and their frequencies. How would you design this, and what trade-offs
would you need to consider?
Solution #8.1
To get the click-through rate, we use the following query, which includes a SUM along with a IF to
obtain the total number of clicks and impressions, respectively. Lastly, we filter the timestamp to
obtain the click-through rate for just the year 2019.
Solution #8.2
"To find the cities with the top three highest number of completed orders, we first write an inner
query to join the trades and user table based on the user_id column and then filter for complete
orders. Using COUNT DISTINCT, we obtain the number of orders per city. With that result, we then
perform a simple GROUP BY on city and order by the resulting number of orders, as shown below:
Solution #8.3
To compare the viewership on laptops versus mobile devices, we first can use a IF statement to
define the device type according to the specifications. Since the tablet and phone categories form
the "mobile" device type, we can set laptop to be its own device type (i.e., "laptop"). We can then
simply SUM the counts for each device type:
Solution #8.4
Since we don't care about the particular order_id or user_id, we can use a window function to
partition by product and order by transaction date. Spending is then summed over every date and
product as follows:
Solution #8.5
In order to obtain a count of products by user, we employ COUNT product_id for each user; hence,
the GROUP BY is performed over user_id. To filler on having spent at least $ 1000, we use a HAVING
SUM(spend) > 1000 clause. Lastly, we order user_ids by product_id count and take the top 10.
Solution #8.6
First, we obtain the number of tweets per user in 2020 by using a simple COUNT within an initial
subquery. Then, we use that tweet column as the bucket within a new GROUP BY and COUNT as
shown below:
Solution #8.7
We can't simply perform a count since, by definition, the purchases must have been made on
different days (and for the same products). To address this issue, we use the window function RANK
while partitioning by user_id and product_id and then order the result by purchase time in order to
determine the purchase number. From this inner subquery, we then obtain the count of user ids for
which purchase number was 2 (note that we don't need above 2 since any purchase number above 2
denotes multiple products).
Solution #8.8
To find all companies with duplicate listings based on title and description, we can use a RANK()
window function partitioning on company_id, job_title, and job_description. Then, we can filter for
companies where the largest row number based on those partition fields is greater than 1, which
indicates duplicated jobs, and then take a count of the number of companies:
Solution #8.9
Although we could use a self join on transaction_date = MIN (transaction_date) for each user, we
could also use the ROW NUMBER window function to get the ordering of customer purchases. We
could then use that subquery to filter on customers whose first purchase (shown in row one) was
valued at 50 dollars or more. Note that this would require the subquery to include spend also:
Solution #8.10
First, we need to obtain the total number of tweets made by each user on each day, which can be
gotten in a CTE using GROUP BY with user_id and tweet_date, while also applying a COUNT DISTINCT
to tweet_id. Then, we use a window function on the resulting subquery to take an AVG number of
tweets over the six prior rows and the current row (thus giving us the 7-day rolling average), while
ordering by user_id and tweet_date:
Solution #8.11
First, we obtain the transaction numbers for each user. We can do this by using the ROW NUMBER
window function, where we PARTITION by the user_id and ORDER by the transaction_date fields,
calling the resulting field a transaction number. From there, we can simply lake all transactions
having a transaction number equal to 3.
Solution #8.12
First, we calculate a subquery with total spend by product and category using SUM and GROUP BY.
Note that we must filter by a 2020 transaction date. Then, using this subquery, we utilize a window
function to calculate the rankings (by spend) for each product category using the RANK window
function over the existing sums in the previous subquery. For the window function, we PARTITION by
category and ORDER by product spend. Finally, we use this result and then filter for a rank less than
or equal to 3 as shown below.
Solution #8.13
First, we obtain the latest transaction date for each user. This can be done in a CTE using the RANK
window function to get rankings of products purchased per user based on the purchase transaction
date. Then, using this CTE, we simply COUNT both the user ids and product ids where the latest rank
is 1 while grouping by each transaction date.
Solution #8.14
A database view is the result of a particular query within a set of tables. Unlike a normal table, a
view does not have a physical schema. Instead, a view is computed dynamically whenever it is
requested. If the underlying tables that the views reference are changed, then the views will change
accordingly. Views have several advantages over tables:
1. Views can simplify workflows by aggregating multiple tables, thus abstracting the complexity of
underlying data or operations.
2. Since views can represent only a subset of the data, they provide limited exposure of the table's
underlying data and hence increase data security.
3. Since views do not store actual data, there is significantly less memory overhead.
Solution #8.15
SQL statements that modify the database, like UPDATE, INSERT, and DELETE, need to change not only
the rows of the table but also the underlying indexes. Therefore, the performance of those
statements depends on the number of indexes that need to be updated. The larger the number of
indexes, the longer it takes those statements to execute. On the flip side, indexing can dramatically
speed up row retrieval since no underlying indexes need to be modified. This is important for
statements performing full table scans, like SELECTs and JOINs.
Therefore, for databases used in online transaction processing (OLTP) workloads, where database
updates and inserts are common, indexes generally lead to slower performance. In situations where
databases are used for online analytical processing (OLAP), where database modifications are
infrequent but searching and joining the data is common, indexes generally lead to faster
performance.
Solution #8.16
A primary key uniquely identifies an entity. It can consist of multiple columns (known as a composite
key) and cannot be NULL.
Characteristics of a good primary key are:
Stability: a primary key should not change over time.
Uniqueness: having duplicate (non-unique) values for the primary key defeats the purpose of the
primary key.
Irreducibility: no subset of columns in a primary key is itself a primary key. Said another way,
removing any column from a good primary key means that the key's uniqueness property would
be violated.
Solution #8.17
Advantages of Relational Databases: Ensure data integrity through a defined schema and the ACID
properties. Easy to get-started with and use for small-scale applications. Lends itself well to vertical
scaling. Uses an almost standard query language, making learning or switching between different
types of relational databases easy.
Advantages of NoSQL Databases: Offers more flexibility in data format and representations, which
makes working with unstructured or semistructured data easier. Hence, useful when still iterating on
the data schema or adding new features/functionality rapidly like in a startup environment.
Convenient to scale with horizontal scaling. Lends itself better to applications that need to be highly
available.
Solution #8.18
At a high level, to shuffle the data randomly, we need to map each row of the input data to a random
key. This ensures that the row of input data is randomly sent to a reducer, where it's simply
outputted. More concretely, the steps of the MapReduce algorithm are:
1. Map step: Each row is assigned a random value from 1,...,k, where k is the number of reducer
nodes available. Therefore, for every key, the output is the tuple (key, row).
2. Shuffle step: Rows with the same input key go to the same reducer.
3. Reduce step: For each record, the row is simply outputted.
Since the reducer only has rows that were filtered randomly for a given value of i, where i is from
1,...,k, the resulting output will be ordered randomly.
Solution #8.19
A couple of answers are possible, but here are some examples:
Similarities:
1. Both clauses are used to limit/filter a given query's results.
2. Both clauses are optional within a query.
3. Usually, queries utilizing one of the two can be transformed to use the other.
Differences:
1. A HAVING clause can follow a GROUP BY statement, but WHERE cannot.
2. A WHERE clause evaluates per row, whereas a HAVING clause evaluates per group.
3. Aggregate functions can be referred to in a logical expression if a HAVING clause is used.
Solution #8.20
Foreign keys are a set of attributes that aid in joining tables by referencing primary keys (although
joins can occur without them). Primarily, they exist to ensure data integrity. The table with the
primary key is called the parent table, whereas the table with the foreign key is called the child table.
Since foreign keys create a link between the two tables, having foreign keys ensures that these links
are valid and prevents data from being inserted that would otherwise violate these conditions.
Foreign keys can be created during CREATE commands, and it is possible to DROP or ALTER foreign
keys.
When designating foreign keys, it is important to think about the cardinality — the relationship
between parent and child tables. Cardinality can take on four forms: one-to-one (one row in the
parent table maps to one row in the child table), one-to-many (one row in the parent table maps to
many rows in the child table), many-to-one (many rows in the parent table map to one row in the
child table), and many-to-many (many rows in the parent table map to many rows in the child table).
The particular type of relationship between the parent and child table determines the specific syntax
used when setting up foreign keys.
Solution #8.21
Both clustered indexes and non-clustered indexes help speed up queries in a database. With a
clustered index, database rows are stored physically on the disk in the same exact order as the index.
This arrangement allows you to rapidly retrieve all rows that fall into a range of clustered index
values. However, there can only be one clustered index per table since data can only be sorted
physically on the disk in one particular way at a time.
In contrast, a non-clustered index does not match the physical layout of the rows on the disk on
which the data are stored. Instead, it duplicates data from the indexed column(s) and contains a
pointer to the rest of data. A non-clustered. index is stored separately from the table data, and
hence, unlike a clustered index, multiple non-clustered indexes can exist per table. Therefore, insert
and update operations on a non-clustered index are faster since data on the disk doesn't need to
match the physical layout as in the case of a clustered index. However, this makes the storage
requirement for a non-clustered index higher than for a clustered index. Additionally, Lookup
operations for a non-clustered index may be slower than that of a clustered index since all queries
must go through an additional layer of indirection.
Solution #8.22
First, we need to obtain the top 100 most popular topics for the given date by employing a simple
subquery. Then, we need to identify all users who followed no topic included within these top 100
for the date specified. Equivalently, we could identify those that did follow one of these topics and
then filter them out of this list of users that existed on 2021-01-01.
Two approaches are as follows:
1. use the MINUS (or EXCEPT) operator and subtract those following a top 100 topic (via an inner
join) from the entire user universe
2. use a WHERE NOT EXISTS clause in a similar fashion.
For simplicity, the solution below uses the MINUS operator. Note that we need to filter for date in
the user_topics table so that we capture only existing users as of 2020-01-01 :
Solution #8.23
In order to calculate user retention, we need to check for each user whether they were active this
month versus last month. To bucket days into each month, we need to obtain the first day of the
month for the specified date by using DATE TRUNC. We use a COUNT DISTINCT over user id to obtain
the monthly active user(MAU) count for the month. This can be put into a subquery called
curr_month, and then EXISTS can be used to check it against another subquery for the previous
months last_monlh. In that subquery, ADD MONTHS can be used with an argument of 1 to get the
previous month. thereby allowing us to Check for user actions from previous month (since that
would mean they were logged in), as shown below:
Solution #8.24
First, we can perform a CTE to obtain the total session duration by user and session type between
the start and end dates. Then, we can use RANK to obtain the rank, making sure to partition by
session type and then order by duration as in the query below:
Solution #8.25
We can obtain the total time spent on sending and opening using conditional IF statements for each
activity type while gelling the amount of time_spent in a CTE. We can also obtain the total time
spent in the same CTE. Next, we take that result and JOIN by the corresponding user_id with
activities. We filter for just send and open activity types and group by age bucket. Then, using this
CTE, we can calculate the percentages of send and open time spent versus overall time spent as
follows:
Solution #8.26
The first step is to determine the query logic for when two sessions are concurrent. Say we have two.
sessions, session 1 and session 2. Note that there are two cases in which they overlap:
1. If session 1 starts first, then the start time for session 2 is less than or equal to session 1's end
time
2. If session 2 starts first, then session 1’s end time for session 1 is greater than or equal to session
2's start time
In total, this simplifies to session 2's start time falling between session 1's start time and session 1's
end time.
With this in mind, we can calculate the number of sessions that started during the time another
session was running by using an inner join and using BETWEEN to check the concurrency case as
follows:
Solution #8.27
First, we need to identify businesses having reviews consisting of only 4 or 5 stars. We can do so by
using a CTE to find the lowest number of stars given to a business across all its reviews. Then, we can
use a SUM and IF statement to filler across businesses with a minimum review of 4 or 5 stars to get
the total number of top-rated businesses, and then divide this by the total number of businesses to
find the percentage of top-rated businesses.
Solution #8.28
First, we need to establish which measurements are odd numbered and which are even numbered.
We can do so by using the ROW NUMBER window function over the measurement_time to obtain
the measurement number during-a day. Then, we filter for odd numbers by checking if a
measurement’s mod 2 is 1 for odds or is 0 for evens. Finally, we sum by date using a conditional IF
statement, summing over the corresponding measurement_value:
Solution #8.29
First, we obtain the latest week's users. To do this, we use NOW for the current time and subtract an
INTERVAL of 7 days, thus providing the relevant user IDs to look at. By using LEFT JOIN, we have all
signed-in users, and whether they made a purchase or not. Now we take the COUNT of DISTINCT
users from the purchase table, divide it by the COUNT of DISTINCT users from the signup table, and
then multiply the results by 100 to obtain a percentage:
Solution #8.30
First, we can join the transactions and product tables together based on product id to get the user
product_name, and transaction time for the transactions. With the CTE at hand, we can do a self join
to fetch products that were purchased together by a single user by joining on transaction_id Note
that we want all pairs of products, but we don't want to overcount, i.e., if user A purchased products
X and Y in the same transaction, then we only want to count the (X, Y) transaction once, and not also
(Y, X). To handle this, we can use a condition within the inner join that the product id of A is less than
that of B (where A and B are the CTE results from before). Lastly, we use a GROUP BY clause for each
pair of products and sort by the resulting count, taking the top 10:
Solution #8.31
First, we look at all users who did not log in during the previous month. To obtain the last month's
data, we subtract an INTERVAL of 1 month from the current month's login date. Then, we use a
WHERE EXISTS against the previous month's interval to check whether there was a login in the
previous month. Finally, we COUNT the number of users satisfying this condition.
Solution #8.32
First, we need to obtain the total weekly spend by product using SUM and GROUP BY operations and
use DATE TRUNC on the transaction date to specify a particular week. Using this information, we
then calculate the prior year's weekly spend for each product. In particular, we want to take a LAG
for 52 weeks, and PARTITION BY product, to calculate that week's prior year spend for the given
product. Lastly, We divide the current total spend by the corresponding previous 52-week lag value:
Solution #8.33
First, we need to obtain the total daily transactions using a simple SUM and GROUP BY operation.
Having the daily transactions, we then perform a self join on the table using the condition that the
transaction date for one transaction occurs within 7 days of the other, which we can check by using
the DATE_ADD function along with the condition that the earlier date doesn't precede the later date:
Solution #834
To use MapReduce to find the number of mutual friends for all pairs of Facebook users, we can think
about what the end output needs to be and then work backward. Concretely, for all given pairs of
users X and Y, we want to identify which friends they have in common, from which we'll derive the
mutual friend count. The core of this algorithm is finding the intersection between the friends list for
X and the friends list for Y. This operation can be delegated to the reducer. Therefore, it is sensible
that the key for our reduce step should be the tuple (X, Y) and that the value to be reduced is the
combination of the friends list of X and the friends list of Y. Thus, in the map step, we want to output
the tuple (X, Z) for each friend Z that X has.
As an example, assume that X is friends with [W, Y, Z] and Y is friends with [X, Z].
1. Map step: For X, we want to output the following tuples: 1) ((X, W), [W, Y, Z]), 2) ((X, Y), [W, Y, Z]),
and 3) ((X, Z), [W, Y, Z]). For Y we want to output the following tuples: l) ((X, Y), [X, Z]), and 2) ((Y,
Z), [X, Z]). Note that the key is sorted, so that (Y, X) (X, Y).
2. Shuffle step: Each machine is delegated data based on the keys from the map step, i.e., each
tuple (X, Y). So, in the previous example, note that the map step outputs the key (X, Y) for both X
and Y, and therefore both of the keys are on the same machine. That machine will therefore have
the tuple (X, Y) as the key, and will store [W, Y, Z] and [X, Z] to be used in the reduce step.
3. Reduce step: We group by keys and take the intersection of the resulting lists. For the example of
(X, Y) [W, Y, Z], [X, Z], we take the intersection of [W, Y, Z] and [X, Z], which is [Z]. Thus, we
return the length of the set (1) for the input (X, Y).
Therefore, we are able to identify Z as the common friend of X and Y, and can return 1 as the number
of mutual friends, The process outlined above is repeated in parallel for every pair of Facebook users
in order to find the final mutual friend counts between each pair of users.
Solution #8.35
To design a system that tracks search query strings and their frequencies, we can start with a basic
keyvalue store. For each search query string, we store the corresponding frequency in a database
table containing only those two fields. To build the system at scale, we have two options: vertical
scaling or horizontal scaling, For vertical scaling, we would add more CPU and RAM to existing
Every Superman has his kryptonite, but as a Data Scientist, coding can't be yours. Between
data munging, pulling in data from APIs, and setting up data processing pipelines, writing
code is a near-universal part of a Data Scientist's job. This is especially true at smaller
companies, where data scientists tend to wear multiple hats and are responsible for
productionizing their analyses and models. Even if you are the rare Data Scientist that never
has to write production code, consider the collaborative nature of the field— having strong
computer science fundamentals will give you a leg up when working with software and data
engineers.
To test your programming foundation, Data Science interviews often take you on a stroll
down memory lane back to your Data Structures and Algorithms class (you did take one,
right?). These coding questions test your ability to manipulate data structures like lists, trees,
and graphs, along with your ability to implement algorithmic concepts such as recursion and
dynamic programming. You're also expected to assess your solution's runtime and space
efficiency using Big O notation.
After receiving the problem: Don't jump right into coding. It's crucial first to make sure you are
solving the correct problem. Due to language barriers, misplaced assumptions, and subtle nuances
that are easy to miss, misunderstanding the problem is a frequent occurrence. To prevent this, make
sure to repeat the question back to the interviewer so that the two of you are on the same page.
Clarify any assumptions made, like the input format and range, and be sure to ask if the input can be
assumed to be non-null or well formed. As a final test to see if you've Understood the problem. work
through an example input and see if you get the expected output. Only after you've done these steps
are you ready to begin solving the problem.
When brainstorming a solution: First, explain at a high level how you could tackle the question. This
usually means discussing the brute-force solution. Then, try to gain an intuition for why this brute-
force solution might be inefficient, and how you could improve upon it. If you're able to land on a
more optimal approach, articulate how and why this new solution is better than the first brute-force
solution provided. Only after you've settled on a solution is it time to begin coding.
When coding the solution: Explain what you are coding. Don't just sit there typing away, leaving your
interviewer in the dark. Because coding interviews often Iet you pick the language you write code in,
you're expected to be proficient in the programming language you chose. As such, avoid pseudocode
in favor of proper compilable code. While there is time pressure, don't take many shortcuts when
coding. Use clear variable names and follow good code organization principles. Write well-styled
code — for example, following PEP 8 guidelines when coding in Python. While you are allowed to cut
some corners, like assuming helper method exists, be explicit about it and offer to fix this later on.
After you 're done coding: Make sure there are no mistakes or edge cases you didn't handle. Then
write and execute test cases to prove you solved the problem.
At this point, the interviewer should dictate which direction the interview heads. They may ask
about the time and space complexity of your code Sometimes they may ask you to refactor and
clean the code, especially if you cut some corners while coding the solution. They may also extend
the problem, often with a new constraint. For example, they may ask you not to use recursion and
instead tell you to solve the problem recursively Or, they might ask you to not use surplus memory
and instead solve the problem in place. Sometimes, they may pose a tougher variant of the problem
as a follow-up, which might require starting the problem-solving process all over again.
In the context of companies asking interview questions, we care about not just establishing tight
bounds on performance but thinking about the worst-case scenario for this performance. As such,
Big O notation often describes the "worst-case upper bound," or the longest an algorithm would run
or the maximal amount of space it would need in the worst case.
For instance, consider an array of size N. Here are the following classes of runtime complexities, from
fastest to slowest, using Big O notation:
O(1): Constant time. Example: getting a value at a particular index from an array
O(log N): Logarithmic time. Example: binary search on a sorted array
O(N): Linear time. Example: using a for-loop to traverse through an array
O(N log N): Log-linear time. Example: running mergesort on an array
O(N^2): Quadratic time. Example: iterating over every pair of elements in an array using a
double for-loop
O(2^N): Exponential time. Example: recursively generating all binary numbers that are N
digits Long
O(N!): Factorial time. Example: generating all permutations of an array
The same Big-O runtime analysis concepts apply analogously to space complexity. For example, if we
need to store a copy of an input array with N elements, that would be an additional O(N) space. If we
wanted to store an adjacency matrix among N nodes, we would need O(N^2) space to keep the N-
by-N sized matrix.
For a basic example for both runtime and space complexity analysis, we can look at binary search,
where we are searching for a particular value within a sorted array, The code that implements this
algorithm is below (with an extra set of conditions that returns the closest if the exact value is not
found):
If we start the binary search with an input of N elements, then at the next iteration, we would only
need to search through N/2 elements, and so on. The runtime complexity for binary search is O(log
N) since at each iteration we cut the remaining search space in half. The space complexity would
simply be O(N) since the array is size N, and we do not need auxiliary space.
Data Structures
Below is a brief overview of the most common data structures used for coding interviews. The best
way to become familiar with each data structure is by implementing a basic version of it in your
favorite language. Knowing the Big-O for common operations, like inserting an element or finding an
element within the structure, is also essential. The table below can be used for reference:
Data Time Complexity Space
Structure Complexity
Average Worst Worst
Access Search Insertion Deletion Access Search Insertion Deletion
Array 0(1) 0(n) 0(n) 0(n) 0(1) 0(n) 0(n) 0(n) 0(n)
Stack 0(n) 0(n) 0(1) 0(1) 0(n) 0(n) 0(1) 0(1) 0(n)
Queue 0(n) 0(n) 0(1) 0(1) 0(n) 0(n) 0(1) 0(1) 0(n)
Linked List 0(n) 0(n) 0(1) 0(1) 0(n) 0(n) 0(1) 0(1) 0(n)
Hash Map N/A 0(1) 0(1) 0(1) N/A 0(n) 0(n) 0(n) 0(n)
Binary 0(log (n)) 0(log (n)) 0(log (n)) 0(log (n)) 0(n) 0(n) 0(n) 0(n) 0(n)
Search Tree
Arrays
An array is a series of consecutive elements stored sequentially in memory, Arrays are optimal for
accessing elements at particular indices, with an O(1) access and index time. However, they are
slower for searching and deleting a specific value, with an O(N) runtime, unless sorted. An array's
simplicity makes it one of the most commonly used data structures during coding interviews.
Common array interview questions include:
Moving all the negative elements to one side of an array
Merging two sorted arrays
Finding specific sub-sequences of integers within the array, such as the longest consecutive
subsequence or the consecutive subsequence with the largest sum
A frequent pattern for array interview questions is the existence of a straightforward brute-force
solution that uses O(n) space, and a more clever solution that uses the array itself to lower the space
complexity down to O(1 ). Another pattern we've seen when dealing with arrays is the prevalence of
off-by-1 errors — it's easy to crash the program by accidentally reading past the last element of an
array.
For jobs where Python knowledge is important, interviews may cover list comprehensions, due to
their expressiveness and ubiquity in codebases. As an example, below, we use a list comprehension
to create a list of the first 10 positive even numbers. Then, we use another list comprehension to
find the cumulative sum of the first list:
Arrays are also at the core of linear algebra since vectors are represented as 1-D arrays, and matrices
are represented by 2-D arrays. For example, in machine learning, the feature matrix X can be
represented by a 2-D array, With one dimension as the number of data points (n) and the other as
the number of features (d).
Linked Lists
A linked list is composed of nodes with data that have pointers to other nodes. The first node is
called the head, and the last node is called the tail. Linked lists can be circular, where the tail points
to the head. They can also be doubly linked, where each node has a reference to both the previous
and next nodes. Linked lists are optimal for insertion and deletion, with O(1) insertion time at the
head or tail, but are worse for indexing and searching, with a runtime complexity of O(N) for
indexing and O(N) for search.
Then we create the LinkedList class, along with the method to reverse its elements. The reverse
function iterates through each node of the linked list. At each step, it does a series of swaps
between the pointers of the current node and its neighbors.
Like array interview questions, linked list problems Often have an obvious brute-force solution that
uses O(n) space, but then also a more clever solution that utilizes the existing list nodes to reduce
the memory usage to O(1). Another commonality between array and linked list interview solutions is
the prevalence of off-by-one errors. In the linked list case, it's easy to mishandle pointers for the
head or tail nodes.
The main difference between a stack and a queue is the removal order: in the stack, there is a LIFO
order, whereas in a queue it's a FIFO order. Stacks are generally used in recursive operations,
whereas queues are used in more iterative processes.
Common stacks and queues interview questions include:
Writing a parser to evaluate regular expressions (regex)
Evaluating a math formula using order of operations rules
Running a breadth-first or depth-first search through a graph
An example interview problem that uses a stack is determining whether a string has balanced
parentheses. Balanced, in this case, means every type of left-side parentheses is accompanied by
valid right-side parentheses. For instance, the string “({}((){}))” is correctly balanced, whereas the
string "{}() )” is not balanced, due to the last character, ')'. The algorithm steps are as follows:
Hash Maps
A hash map stores key-value pairs. For every key, a hash map uses a hash function to compute an
index, which locates the bucket where that key's corresponding value is stored. In Python, a
dictionary offers support for key-value pairs and has the same functionality as a hash map.
While a hash function aims to map each key to a unique index, there will sometimes be "collisions"
where different keys have the same index. In general, when you use a good hash function, expect
the elements to be distributed evenly throughout the hash map. Hence, lookups, insertions, or
deletions for a key take constant time.
Due to their optimal runtime properties, hash maps make a frequent appearance in coding interview
questions.
Common hash map questions center around:
Finding the unions or intersection of two lists
Finding the frequency of each word in a piece of text
Finding four elements a, b, c and d in a list such that a + b = c + d
An example interview question that uses a hash map is determining whether an array contains two
elements that sum up to some value. For instance, say we have a list [3, 1, 4, 2, 6, 9] and k. In this
case, we return true since 2 and 9 sum up to 11.
The brute-force method to solving this problem is to use a double for-loop and sum up every pair of
numbers in the array, which provides an O(N^2) solution. But, by using a hash map, we only have to
iterate through the array with a single for-loop. For each element in loop, we'd check whether the
complement of the number (target - that number) exists in the hash map, achieving an O(N)
solution:
Due to a hash function's ability to efficiently index and map data, hashing functions are used in many
real-world applications (in particular, with regards to information retrieval and storage). For example,
say we need to spread data across many databases to allow for data to be stored and queried
efficiently while distributed. Sharding, covered in depth in the databases chapter, is one way to split
the data. Sharding is commonly implemented by taking the given input data, and then applying a
hash function to determine which specific database shard the data should reside on.
Trees
A tree is a basic data structure with a root node and subtrees of children nodes, The most basic type
of tree is a binary tree, where each node has at most two children nodes. Binary trees can be
implemented with a left and right child node, like below:
There are various types of traversals and basic operations that can be performed on trees. For
example, in an in-order traversal, we first process the left subtree of a node, then process the
current node, and, finally, process the right subtree:
The two other closely related traversals are post-order traversal and pre-order traversal. A simple
way to remember how these three algorithms work is by remembering that the "post/pre/in" refers
to the placement of the processing of the root value. Hence, a post-order traversal processes the left
child node first, then the right child node and, in the end, the root node. A pre-order traversal
processes the root node first, then the left child node, and then, the right child node.
For searching, insertion, and deletion, the worst-case runtime for a binary tree is O(N), where N is
the number of nodes in the tree.
Common tree questions involve writing functions to get various properties of a tree, like the depth
of a tree or the number of leaves in a tree. Oftentimes, tree questions boil down to traversing
through the tree and recursively passing some data in a top-down or a bottom-up manner. Coding
interview problems also often focus on two specific types of trees: Binary Search Trees and Heaps.
To find 9, we first examine the root value, 8. Since 9 is greater than 8, the node containing 9, if it
exists, would have to be on the right side of the tree. Thus, we've cut the search space in half. Next,
we compare against the node 10. Since 9 is less than 10, the node, should it exist, has to be on the
left of 10. Again, we've cut the search space in half. In conclusion, since 10 doesn't have a left child,
we know 9 doesn't occur in the tree. By cutting the search space in half at each iteration, BSTs
support search, insertion, and deletion in O(log N) runtime.
Because of their lookup efficiency, BSTs show up frequently not just in coding interviews but in real-
life applications. For instance, B-trees, which are used universally in database indexing, are a
generalized version of BSTs. That is, they allow for more than 2 nodes (up to M children), but offer a
searching and insertion process similar to that of BST. These properties allow 'B-trees to have O(log
lookup and insertion runtimes similar to that of BSTs, where N is the total number of nodes in the B-
tree. Because of the logarithmic growth of the tree depth, database indexes with millions of records
often only have a B-tree depth of four or five layers.
Example of a B-Tree
Common BST questions cover:
Testing if a binary tree has the BST property
Finding the k-th largest element in a BST
Finding the lowest common ancestor between two nodes (the closest common node to two
input nodes such that both input nodes are descendants of that node)
An example implementation of a BST using the TreeNode class, with an insert function, is as follows:
Heaps
Another common tree data structure is a heap. A max-heap is a type of heap where each parent
node is greater than or equal to any child node. As such, the largest value in a max-heap is the root
value of the tree, which can be looked up in O(1) time. Similarly, for a min-heap, each parent node is
smaller than or equal to any child node, and the smallest value lies at the root of the tree and can be
accessed in constant time.
Minimum Heap
Maximum Heap
To maintain the heap properly, there is a sequence of operations known as "heapify", whereby
values are "bubbled up/down" within the tree based on what value is being inserted or deleted. For
example, say we are inserting a new value into a min-heap. This value starts at the bottom of the
heap and then is swapped with its parent node ("bubbled up") until it is no longer smaller than its
parent (in the case of a min-heap). The runtime of this heapify operation is the height of the tree,
O(log N).
In terms of runtime, inserting or deleting is O(log N), because the heapify operation runs to maintain
the heap property. The search runtime is O(N) since every node may need to be checked in the
worst-case scenario. As mentioned earlier, heaps are optimal for accessing the min or max value
because they are at the root, i.e., O(1) lookup time. Thus, consider using heaps when you care
mostly about finding the min or max value and don't need fast lookups or deletes of arbitrary
elements. Commonly asked heap interview questions include:
finding the K largest or smallest elements within an array
finding the current median value in a stream of numbers
sorting an almost-sorted array (where elements are just a few places off from their correct spot)
To demonstrate the use of heaps, below we find the k-largest elements in a list, using the heapq
package in Python: