Interview Query FANG Question2

This document discusses how to deal with missing square footage data in a dataset of 100K housing listings that is being used to build a model to predict housing prices in Seattle. 20% of the listings are missing square footage data. The document recommends either dropping the missing data if model accuracy is not significantly reduced, or using imputation methods like taking the mean or median square footage or using a nearest neighbors approach to estimate square footage based on other listing features. More advanced imputation could estimate square footage based on number of bedrooms, bathrooms, and neighborhood to better approximate missing values.

Uploaded by

Zheng Yuxiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views2 pages

Interview Query FANG Question2

Uploaded by

Zheng Yuxiang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

https://app.interviewquery.

com/questions/missing-housing-data

MISSING HOUSING DATA

Interview Query

Question:

We want to build a model to predict housing prices in the city of Seattle. We've scraped 100K sold listings over
the past three years but found that around 20% of the listings are missing square footage data.

How do we deal with the missing data to construct our model?

solution

This is a pretty classic modeling interview question. Data cleanliness is a well-known issue within most data-
sets when building models. Real life data is messy, missing, and almost always needs to be wrangled with.

The key to answering this interview question is to probe and ask questions to learn more about the specific
context. For example, we should clarify if there are any other features missing data in the listings.

If we're only missing data within the square footage data column, we can build models of different sizes
of training data with under 80% of the dataset to see what the learning curve looks like. If the housing
model at 60% of available data is only slightly less accurage than at 80% of the square
footage data, then depending on model accuracy bandwiths, we may be able to just drop all of the missing
data for our model. We also might have a larger problem of then feature selection given a 30% increase in
data does not improve our model accuracy by that much.

The second most common method is imputation. Imputation can be calculated with different methods
and algorithms but at it's core, it is the process of filling in missing data with estimations. We have figured
out that we can't validate our model well by excluding the missing data, so we can try different imputation
techniques and cross validate our models against the techniques to figure out which ones are the best.

The simple imputation method for a continous variable such as square footage would be to insert the mean or
median of the distribution for all of the missing values. The downsides to this approach is that it doesn't factor
correlation between features and doesn't account for uncertainty. It would
be ridiculous if a studio condo had the same square footage as a five bedroom home.

To solve for that problem, a secondary more advanced method would be to use a simple nearest neighbors
method to approximate a square footage based on grouping different categorial features. What if we
could extrapolate means from different subsets of other features amongst the housing dataset? If we took the
average square footage for each listing grouped by the number of bedrooms, we could impute an average
square footage for a studio versus a five bedroom home.

1
https://app.interviewquery.com/questions/missing-housing-data

Taking it a step further, we can create this nearest neighbor model by introducing multiple
categorical features dependent on the size of the dataset. If we took the average square footage based on
existing values for each subset of number of bedrooms, bathrooms, and neighorhood location, we
would get an even better approximation of what the square footage size would look like . An example scenar-
io would be a two bedroom one bath condo in Capitol Hill averaging 750 square feet versus a four bedroom
four bath house in Magnolia averaging 2000 square feet.

User Guide: Orcdy97517 Oiov22930vsdssssd
80% (5)
User Guide: Orcdy97517 Oiov22930vsdssssd
166 pages
Final SIP Report
100% (5)
Final SIP Report
56 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Executive Summary: Municipality of Tigbauan Comprehensive Land Use Plan 2014-2024
No ratings yet
Executive Summary: Municipality of Tigbauan Comprehensive Land Use Plan 2014-2024
96 pages
AMCAT WRITE X Content
No ratings yet
AMCAT WRITE X Content
3 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
Data Cleaning and Preprocessing Techniques
No ratings yet
Data Cleaning and Preprocessing Techniques
13 pages
EMPLOYEE
No ratings yet
EMPLOYEE
92 pages
Practical Lesson 2 Cultivation of Drosophila Melanogaster
0% (1)
Practical Lesson 2 Cultivation of Drosophila Melanogaster
5 pages
Alnahhal Mohammed - PHD Thesis
100% (1)
Alnahhal Mohammed - PHD Thesis
192 pages
2 - Machine Learning - 130824
No ratings yet
2 - Machine Learning - 130824
81 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
5-Surface Conversion - Vitreous Enamel Coating
0% (1)
5-Surface Conversion - Vitreous Enamel Coating
7 pages
Data Science Tutorial
No ratings yet
Data Science Tutorial
40 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Teen Luxury JBR Gentina PDF
No ratings yet
Teen Luxury JBR Gentina PDF
8 pages
Data Analysis Advance House Price Prediction 1682585529
No ratings yet
Data Analysis Advance House Price Prediction 1682585529
73 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
50 pages
Investigation and Comparison Missing Data Imputation Methods
No ratings yet
Investigation and Comparison Missing Data Imputation Methods
73 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
DA Lab
No ratings yet
DA Lab
27 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
AIML
No ratings yet
AIML
13 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Lec4 SWN MC
No ratings yet
Lec4 SWN MC
45 pages
ML Unit 3
No ratings yet
ML Unit 3
17 pages
Design of Fermenter
No ratings yet
Design of Fermenter
145 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
Python Training in Chandigarh
No ratings yet
Python Training in Chandigarh
8 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
7 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
DT - Missing Values
No ratings yet
DT - Missing Values
11 pages
Project PDF
No ratings yet
Project PDF
13 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Statement of Purpose Galway
100% (2)
Statement of Purpose Galway
2 pages
How To Improve The Accuracy of A Classification Model
No ratings yet
How To Improve The Accuracy of A Classification Model
6 pages
Assignment 4 MB511
No ratings yet
Assignment 4 MB511
6 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Sagat Workout PDF
No ratings yet
Sagat Workout PDF
7 pages
Cinematography Courses in Kerala - Luminar Filim Academy
No ratings yet
Cinematography Courses in Kerala - Luminar Filim Academy
9 pages
Prediction
100% (1)
Prediction
10 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Contextualized Lesson Plan in Mathematics
No ratings yet
Contextualized Lesson Plan in Mathematics
5 pages
Silibus Dca 1013 Building Materials Technology - Mac 2021
No ratings yet
Silibus Dca 1013 Building Materials Technology - Mac 2021
9 pages
Updated ABC Document
No ratings yet
Updated ABC Document
3 pages
QuickGuide eAccessAccount
No ratings yet
QuickGuide eAccessAccount
4 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
Cifrado CMD de Google
No ratings yet
Cifrado CMD de Google
6 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
12 pages
FAQ - ReCell
No ratings yet
FAQ - ReCell
5 pages
A8P Aluminium Bent Axis Pumps
No ratings yet
A8P Aluminium Bent Axis Pumps
10 pages
Bigodi TOR
No ratings yet
Bigodi TOR
14 pages
Chapter 6
No ratings yet
Chapter 6
17 pages
CHAPTER 9-Preferences, Production and Choices
No ratings yet
CHAPTER 9-Preferences, Production and Choices
6 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples) - by Will Badr - Towards Data Science
10 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
A Novel Model Based On Non Invasive Methods For Prediction of Liver Fibrosis
No ratings yet
A Novel Model Based On Non Invasive Methods For Prediction of Liver Fibrosis
6 pages
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
No ratings yet
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
12 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
Fuzzy Based Techniques For Handling Missing Values
No ratings yet
Fuzzy Based Techniques For Handling Missing Values
6 pages
Admission 2023-24 Procedure For Foreign Nationals
No ratings yet
Admission 2023-24 Procedure For Foreign Nationals
2 pages
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples)
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset (Data Imputation With Examples)
10 pages
Certificate ELBTHH78385
No ratings yet
Certificate ELBTHH78385
2 pages
Uncertainty Analysis of Penicillin V Production Using Monte Carlo Simulation.
No ratings yet
Uncertainty Analysis of Penicillin V Production Using Monte Carlo Simulation.
13 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Perunthalaivar Kamarajar Institute of Engineering and Technology (Pkiet) Nedungadu - Karaikal - 609 603
No ratings yet
Perunthalaivar Kamarajar Institute of Engineering and Technology (Pkiet) Nedungadu - Karaikal - 609 603
9 pages
Membership/ Renewal Form: Indian Association of Permanent Makeup
No ratings yet
Membership/ Renewal Form: Indian Association of Permanent Makeup
2 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Pearl Millet Guide
No ratings yet
Pearl Millet Guide
4 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
Transcript Record: Transcript Not Official Unless Delivered Through Parchment Exchange
No ratings yet
Transcript Record: Transcript Not Official Unless Delivered Through Parchment Exchange
1 page
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet

Interview Query FANG Question2

Uploaded by

Interview Query FANG Question2

Uploaded by

https://app.interviewquery.

MISSING HOUSING DATA

How do we deal with the missing data to construct our model?

You might also like