Interview Query FANG Question2
Interview Query FANG Question2
com/questions/missing-housing-data
Question:
We want to build a model to predict housing prices in the city of Seattle. We've scraped 100K sold listings over
the past three years but found that around 20% of the listings are missing square footage data.
solution
This is a pretty classic modeling interview question. Data cleanliness is a well-known issue within most data-
sets when building models. Real life data is messy, missing, and almost always needs to be wrangled with.
The key to answering this interview question is to probe and ask questions to learn more about the specific
context. For example, we should clarify if there are any other features missing data in the listings.
If we're only missing data within the square footage data column, we can build models of different sizes
of training data with under 80% of the dataset to see what the learning curve looks like. If the housing
model at 60% of available data is only slightly less accurage than at 80% of the square
footage data, then depending on model accuracy bandwiths, we may be able to just drop all of the missing
data for our model. We also might have a larger problem of then feature selection given a 30% increase in
data does not improve our model accuracy by that much.
The second most common method is imputation. Imputation can be calculated with different methods
and algorithms but at it's core, it is the process of filling in missing data with estimations. We have figured
out that we can't validate our model well by excluding the missing data, so we can try different imputation
techniques and cross validate our models against the techniques to figure out which ones are the best.
The simple imputation method for a continous variable such as square footage would be to insert the mean or
median of the distribution for all of the missing values. The downsides to this approach is that it doesn't factor
correlation between features and doesn't account for uncertainty. It would
be ridiculous if a studio condo had the same square footage as a five bedroom home.
To solve for that problem, a secondary more advanced method would be to use a simple nearest neighbors
method to approximate a square footage based on grouping different categorial features. What if we
could extrapolate means from different subsets of other features amongst the housing dataset? If we took the
average square footage for each listing grouped by the number of bedrooms, we could impute an average
square footage for a studio versus a five bedroom home.
1
https://app.interviewquery.com/questions/missing-housing-data
Taking it a step further, we can create this nearest neighbor model by introducing multiple
categorical features dependent on the size of the dataset. If we took the average square footage based on
existing values for each subset of number of bedrooms, bathrooms, and neighorhood location, we
would get an even better approximation of what the square footage size would look like . An example scenar-
io would be a two bedroom one bath condo in Capitol Hill averaging 750 square feet versus a four bedroom
four bath house in Magnolia averaging 2000 square feet.