General Material
General Material
This checklist is a curated list of topics found in a majority of past interviews. Interviews
typically cover a wide range of these topics but then will go in depth for those that are the most
relevant with the specific role. These topics are split into ROI for typical ML/AI engineer and
research roles.
Understanding the material is only a small portion of the interview. Effectively communicating
your knowledge and how you would contribute to the business goals is a critical portion
of the interview. Previous fellows have found that practice interviews, algorithm questions, and
code/data challenges are the best way to prepare for real interviews.
Highest ROI
1. CS Fundamentals
2. Cultural Questions and Effective Communication
High ROI
1. Stats and Probability
2. Business and ML Case Studies
3. Randomization
4. Machine Learning
Lower ROI
1. Deep learning basics
2. SQL
3. Tools
1
CS Fundamentals
Method:
● Breadth first (make sure you have a high level understanding of all topics before diving
too deep in one)
● Solve questions systematically !!!
Process:
● Go through every relevant chapter in Cracking the Coding Interview (6th ed) and:
● Read the introduction to know all the important concepts
● Try to solve the first exercise (only using a whiteboard!)
● If you struggle, or have looked at all the other chapters, keep doing more
exercises
Must Know:
O(N) notation/“ Time/Space Complexity”:
● Know the ones for common sorts and data structure operations
● Be able to give the time and space complexity of your solution to problems
Data structures
● Hash table (on coursera): hash functions
● Stack / Queue / Deque
● Linked list (O(n) vs O(1) implementations)
● Heap: constructs in O(n) time
● Trees: balanced (same number of nodes on each side for level n-1) vs. unbalanced
● Binary search trees (BST)
● Bloom filters
● Understand time and space complexities of typical operations (lookup, insert, delete,
update) as well as implementation details in relational database
● ML Related data structures
○ How to represent things related to deep learning (vectors,matrices)
○ Various issues that come up with this (sparsity, etc)
Algorithms
● Sorting algorithms (don’t spend too much time) good to know
○ Bubble sort O(n2), constant memory
○ Selection sort O(n2), constant memory
○ Insertion sort O(n2), constant memory
○ Merge sort O(n log n), 2n memory
○ Quick sort O(n2), constant memory [O(n log n) on average, O(n^2) for worst case.]
○ Heap sort O(n log n), constant memory
○ Timsort (hybrid of merge and insertion) (only need to know that this is what
Python does)
○ Bucket sort
● Graph algorithms
○ depth first search (DFS) & breadth first search (BFS) + recursion
○ Directed Acyclic Graph good to know
2
○ Dijkstra’s algorithm (shortest path) good to know
○ path of least resistance good to know
○ Traveling salesman - graph traversal
● Dynamic programming: memoization, knapsack problem
● Recursion
○ Fibonacci
Good to know conceptually
● Concurrency - how to avoid race conditions, deadlocks, livelocks
● Threading vs Processes (high level explanation)
● Modulo and Hashing (MD5, SHA-1)
Example problems:
● Write a function that lists all three-number lock combinations, where each number is
between 0 and 30 with no repeats. Then, write a function which generalizes to more than
three numbers in the combination. Then, write a function which generalizes to longer
combinations, and doesn’t store all the combinations in a list (if you hadn’t already in the
second part).
● Traverse a binary search tree in order without using recursion
● Write a function which takes a base-10 number and a positive integer k >= 2 and prints
out the number in base k.
● Compute the median of a set of numbers which is too large to fit into memory (Hint: use
a histogram)
● Solve the 0-1 (as opposed to unbounded) knapsack problem
● Given an array of numbers find the subarray which has the greatest sum in linear time.
● Text processing exercises:
○ reverse a string in-place (without consuming any extra memory)
○ remove stop words from a sentence
○ find the palindromes (substrings which read the same forward and backward) in
a given string
○ word count (not necessarily in MapReduce)
● Other resources for questions:
○ http://katemats.com/interview-questions/
Tips:
● Always start simple with the least efficient solution. n^2 or whatever. Shows that you
understand the problem and buys you some time.
● Always be speaking and talking while you’re doing the problem.
● Instead of just giving the regular answer from the interview book, talk about how you
would approach this area followed by the answer.
○ Any question in recursion: I don't normally use recursion because it may not
scale well, but here is how I would approach this problem <followed by the
regular answer>
○ Any question on multiple thread locks: I would probably restructure the code to
be simpler to avoid multiple locks. If that's not possible, do this <regular answer>
● The structure of the code is very important - shows how you break down a problem
○ Modular, different functions for different pieces, able to swap out pieces
3
○ This is often more important than the exact code. Could have a syntax error but
high quality breakdown of the problem and algorithmic thinking process
Resources:
● MIT OCW videos
● Problem Solving with Algorithms and Data Structures in Python
● Cracking the Coding Interview (Github)
● Coursera Algorithms course
● Programming Interviews Exposed (in Drive)
● Leetcode or HackerRank for practice interview questions
● Khan Academy Algorithms
● Stanford Coursera Algos
4
Business and ML Case Studies
These questions make up a large portion of interviews. They are very open ended and test
both your product sense and how you work as an applied AI practitioner on a high level. They
usually do not have a single right answer, and rely heavily on communicating with your
interviewer. Because of their open ended nature, the listed questions are not meant to be
comprehensive, but rather a starting point. A company deep dive is absolutely necessary
to perform well on these questions.
Solve these questions top-down (start with business needs, and ways to evaluate
success, and ways to get a simple benchmark BEFORE talking about algorithms)
5
● How can you analyze the data?
○ Simple first pass
■ Visual trends / outliers
■ Counts / Group By / Histograms
○ More advanced
■ Regression relationships
■ Unsupervised Clustering
● What conclusions/relationships can you find from a graph?
● What actions/decisions are made as a result?
● How can you measure success/impact of those actions/decisions?
6
Serve and Scale a Machine Learning Algorithm
Read Martin Zinkevich’s ML Eng best practices, another set of viewpoints
Design an Experiment
7
● What is the hypothesis to be tested?
● What are the controls?
● Is the experiment worth running in the first place?
● How exactly will the experiment be run?
○ What effect size is needed for us to make decision?
○ How many samples are needed to see the effect size as significant?
○ Would you peek at the data? In which cases?
■ How does this affect the experiment?
○ When/how would you stop?
● What would the results look like?
● What biases do you need to worry about?
○ How do you correct for them?
○ What caveats are in place for those that you can’t correct for?
● What are the action items for each possible result of the experiment?
Lingo/Topics to know
● Analytics
○ SEO
○ SEM
○ Social Marketing
○ Viral Coefficient
● Churn
● Customer lifetime value (LTV)
● Funnel analysis
● Pain Points
● Cohort analysis
● Correlation does not imply causation!
● A/B testing - Udacity’s course
○ A/A testing
Power analysis
8
Machine Learning
Content:
Classification/Regression
● Linear Regression, Logistic Regression
○ Regularization, number of features
○ Know the derivation/algorithm of both
● Decision trees - how do they work, how would you scale/parallelize/serve
○ Random Forests
○ Gradient Boosted Trees (and why XGBoost is good)
● Random Forests, Gradient Boosted Trees (decision trees)
● SVM
○ Be able to describe the cost function, what it is optimizing, and kernel trick
○ Soft margin vs Hard margin
○ MIT Course on SVM
● Naive Bayes
● kNN
● Neural networks
○ How do you know a NN has converged?
○ Describe regularization methods
○ What is batch norm, how would you implement it?
● Clustering
○ K-means be able to code up and implement, know about initialization and
convergence
○ Gaussian Mixture Models
● Dimensionality reduction
○ SVD, PCA,
○ t-SNE
○ manifold learning,
○ Embeddings - (word2vec, GloVE, as components of NN models)
● Optimization schemes:
○ Batch: Gradient Descent, Conjugate Gradient, BFGS, etc.
○ Online: SGD, RMSProp, ADAM
● Model selection (in particular k-fold validation)
● Combining models - Ensembles, Boosting
Recommendation systems
● Ethan’s blog posts are a good overview (see python-recsys if you want a library):
9
○ 1
○ 2
● Different recommendation systems, including an example of youtube
Tips:
● Know Python scikit-learn modules and processes
● Know the pros and cons of different algorithms
● Know how the algorithms work
● Know the ins and outs of all the ML algorithms you list on your resume
Resources:
● Stanford CS 229 Machine Learning (taught by Andrew Ng)
● CMU/10-701
● For review see shape of data blog and LearnDataScience on github
● Read Martin Zinkevich’s ML Eng best practices
10
Cultural Questions and Effective Communication
Every interaction (email/phone/onsite) is a chance for both the interviewers and you to evaluate
whether you all want to spend 40+ hours each week together. Once you pass the technical bar,
cultural fit is the last thing that can determine whether or not you receive the offer. These
questions range from being focused on you to being focused on the company.
Use a systematic approach and prepare your answers ahead of time!! See this sheet.
Resources:
● Here’s a list of common cultural/behavioral interview questions -- be prepared to answer
at least a few in each category
Content:
More about you
● Tell me about your background - elevator pitch
○ Prepare to answer questions for any part of your demo and resume
● Tell me about a time when… (STAR - Situation, Task, Action, Result)
○ Didn’t get along with a coworker
○ Overcame a problem
○ Here is a list of 75 Behavioral Interview Questions (in Drive)
● Why AI? Why ML engineering?
● What roles are you interested in?
● Why do you want to join our company?/What excites you about this position?
● What are you going to bring to the company? (Talk about this even if not asked!)
○ Read first couple of chapters of Sell Yourself in any Interview, by Oscar Adler (on
bookshelf)!
● Come prepared with questions about the company that you care about!
○ MAKE A LIST BEFOREHAND
○ Having no questions implies a lack of interest
○ Lookup roles of interviewers beforehand
● Check out the interview experience of others on GeekForGeeks or Glassdoor.
● Data storytelling
○ Describe challenges, roadblocks you encountered and how you overcame them
by describing algorithms, data cleaning, tools, pipeline, etc. you employed.
● Coding in real time while “speaking out” your thought process
○ on a board
○ pair programming
○ online & over phone - Coderpad.io
Tips:
● Know your projects inside out. You should know how your model(s) work, and be able to
speak about it confidently -- because it's very bad to say "maybe this is what it does.. not
11
really sure...." without conviction.
● If there is something on your resume (especially specific algorithm names), then make
sure you know it and can confidently talk about it.
● Look up background of people interviewing and ask questions tailored to their
background, if you can.
● A day before an on-site interview, do a product deep dive and specifically take some
time to think of products (features) you would build if you could build anything for that
company.
○ Make sure to think about internal tools and pipelines to enable product, features,
and analytics of those products and features.
○ Talk it out with other Fellows.
○ How would you improve their site/product?
○ Why would you build that?
○ What would be the challenges in building it?
○ Get creative
12
Deep learning basics
As a good starting point, we recommend Goodfellow, Bengio, and Courville’s Deep
Learning book. It’s quite lengthy, so it’s best to concentrate on Chapters 6-10 but feel
free to explore the introductory or more in-depth material.
There are also a ton of high quality videos on the web. Neural Networks demystified for
a more basic and gentle introduction, Stanford’s CS231n with Andrej Karpathy (though
almost solely image based), and Fast.AI’s Practical Deep Learning for Coders.
Additionally, Hugo Larochelle’s talk the recent Bay Area Deep Learning School gives a
great walk through and review of all of the components of a deep feed-forward neural
net (other good talks in the video too, though some get a little too detailed). Note Ng’s
talk (above) is at the end of this video.
The Nuts and Bolts of Applying Deep Learning (Andrew Ng), gives a great overview of
how to practically build, train, and implement AI systems
Finally, there are a ton of good blogs out there, including Andrej Karpathy’s and Chris
Olah’s.
Randomization
● Biases in your dataset (how to check, how might they creep in?)
● How to randomize things
● Limitations
● Augmentations
● Repeatedly random
○ Hashing integers
● Dealing with imbalanced datasets
○ Two great posts here and here.
● Streaming data
○ How do you deal with that?
○ Running averages
○ Running stds
○ Maybe not know formula but know how to compute things on streaming data
○ KL-divergence
13
SQL
Most of the teams won’t be testing this in interviews, but since most of the companies still use
SQL as their main data store, it’s helpful to understand how this fits into the modeling
ecosystem. I would at max only spend 1-2 days working on this.
Content:
● Queries
○ Joins: inner, left (outer), right (outer), full outer
■ esp. see Venn diagram visualization
○ Select
■ nested selects
■ create as select (and views)
○As
○Where
○Order By
○Limit
○Aggregate functions (min/max, avg (mean), count, first/last)
■ Having
■ Group By
○ Insert/delete
○ union
○ window
○ If/Then
■ Case
■ Coalesce
■ IF in SELECT statement:
● http://www.java2s.com/Code/SQL/Flow-Control/UseIFinselectclause.htm
● Pros and cons of SQLite, MySQL, PostgreSQL, Oracle, NoSQL, etc.
● Basic principles of MapReduce/Hadoop/Hive, shard, RethinkDB
● Style
○ indentation is important. Write on whiteboard with proper indentation and you’ll
show you write clean SQL.
○ Capitalize keywords (camel case). Table names or variables, lowercase.
Resources:
● One of the best ways to learn and practice is the Mode SQL school. You can also try
SQLZoo.
● See 10 steps to a complete understanding of SQL and the brief command listing at
w3schools for syntax.
● If you have time, take the coursera db class, and read up about relational algebra, if not,
try the exercises.
● Some of the more complicated stuff with SQL that has shown up in interviews.
14
Stats and Probability
Hiring managers have specifically stated that they expect Insight Fellows to know graduate level
statistics very well; give this section a high priority.
Content:
More likely
● Probability and Bayes theorem
○ p values
○ Monty Hall problem
○ What’s the probability you have a disease if you test positive for it?
○ Bayes and Bayesian concepts applied to ML are growing in popularity
● Modeling regressions
○ multiple, logistic
○ Feature selection
● A/B Testing
● Multiple Comparisons Correction
○ Bonferroni correction
● Use cases of different statistics (mean, median, standard deviation, variance, standard
error, etc).
○ Confidence intervals
○ Type 1 and type 2 error: OIStats pg 176: “A Type 1 Error is rejecting the null
hypothesis when H0 is actually true. A Type 2 Error is failing to reject the null
hypothesis when the alternative is actually true.”
○ When would you want to use the mean instead of the median? When we need to
Less likely
● maximum likelihood estimator
● Sampling distribution
● Resampling
○ bootstrap (random sampling), bootstrap for confidence interval on median
○ jack-knife (all possible subsets)
Unlikely
● T-test
○ one tailed vs. two tailed tests
○ ANOVA (requirements: independence, approximately normal, constant variance)
■ simple use cases of ANOVA
●
● Hypothesis testing (complicated)
● chi square: test for normality
○ When is it appropriate to assume a normal distribution, and when is it not?
2
● R (Pearson’s correlation coefficient, squared)
15
● QQ plot
● “descriptive stats”
● Kolmogorov-Smirnov tests (non parametric)
● Power analysis
○ http://www.ats.ucla.edu/stat/seminars/Intro_power/
Example questions:
● How would you do variable selection? What kind of model with x data? Can you explain
what hypothesis testing means? What’s a p value?
● How do you know if the data you have is appropriate to model in a regression? How do
you test it? How do you know if your data is normally distributed? What do you do if it’s
not? How do you test if your data is distributed the same as another system?
Resources:
● openintro stats
● statistics done wrong
● Cartoon Guide to Statistics
● Khan Academy Probability
Tools
● Version control -- git/github:
○ commands: init, commit, push, branch, clone, pull, merge, status, diff, log, stash
○ basics of .gitconfig and .gitignore
○ Go through this quick git tutorial, and test what you have learned by going
through this Github Developmental setup
● Deep Learning Frameworks - pros and cons
○ Understanding autodiff - TensorFlow uses forward mode, Torch/chainer uses
reverse. The latter allows the computational graph to be defined on the fly.
○ Interesting post here.
16