0% found this document useful (0 votes)
3 views40 pages

KC Housing Assignment Fall 2024

The assignment for MBAN 5520 involves using a dataset from Kings County, Washington, to practice building predictive models using Excel and Python. Students are required to explore data, create histograms and correlation tables, and analyze relationships between property prices and various predictors, particularly focusing on sqft_living. The assignment emphasizes the challenges of model building, such as data imbalance and multicollinearity, while also requiring students to document their learning process and insights gained from the analysis.

Uploaded by

Isabella Ninja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views40 pages

KC Housing Assignment Fall 2024

The assignment for MBAN 5520 involves using a dataset from Kings County, Washington, to practice building predictive models using Excel and Python. Students are required to explore data, create histograms and correlation tables, and analyze relationships between property prices and various predictors, particularly focusing on sqft_living. The assignment emphasizes the challenges of model building, such as data imbalance and multicollinearity, while also requiring students to document their learning process and insights gained from the analysis.

Uploaded by

Isabella Ninja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

MBAN 5520 Statistics and Predictive Analytics Fall 2024

Assignment #1
Due: Tuesday, December 2nd at 11:59pm

The term is almost over so there will only be one “assignment”. The purpose is primarily to get
you to practice use of tools and understanding results. In this assignment, I also want to have
you experience the wandering pathway you may follow in building a predictive model. The data
we will use is a popular dataset that you can find on Kaggle. You can also find numerous data
analyses and models for this dataset on Kaggle and elsewhere.

https://www.kaggle.com/datasets/astronautelvis/kc-house-data

The data set contains information about 20 characteristics of 21,613 properties. The data is from
Kings County in Washington. You may find it useful to actually look at the area on a map. The
data set has the latitude and longitude as well as zip code for each property. For example, I
googled the zip code for some of the most expensive properties and got this map.
https://www.unitedstateszipcodes.org/98004/

In the assignment, I have asked you to do a variety of tasks using Excel. Later I have asked you
to repeat some of this work using Python. Having done it first in Excel will give you a sense of
what the Python output should look like (so you will know if Python did what you thought it
should do). For the Python, I ask that you use a prompt to AI (Co-pilot, ChatGPT, or any other
LLM that you choose). I have asked to see your prompt, the Python code and the output. If
your attempt yields different results from Excel, I ask that you submit your original
prompt and commentary on why the results are not what you expected. I want to see your
learning process. In the future you will use AI to write code and the challenge will be knowing
when the outcome is what you thought you asked for. Don’t give me your corrected version – I
am sure you will figure it out.

With Excel, I recommend that for each question, you copy the necessary data into
a new worksheet and label the sheet with the question number. Put the output of
any analyses on the same page. Do not do your work directly on the data page,
since you risk accidentally corrupting the data.

Print/paste your answers directly into this Word file. It makes it easier for me to grade
when I can see the question you are responding to.

Page 1 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

1. Begin by exploring the data.


a. Construct a histogram of price. Paste a picture below.

Observations from the Histogram


Shape and Skewness:
The histogram shows that most property prices are clustered at the lower end, creating a
right-skewed shape.
There’s a long tail on the right side, which represents a small number of very expensive
properties.
Median Home Value from ZIP Code 98004:
A red dashed line marks the median home value of $2,240,226 for ZIP Code 98004.
This median is much higher than most property prices, showing that homes in ZIP Code
98004 are generally more expensive.
Comparison to Excel Data:
Both Excel files (KC Housing.xlsx and kc_final.csv) show a wide range of property
prices, with many homes priced under $500,000.
The histogram confirms this, showing that lower and mid-range prices dominate the
dataset.
Frequency of High-Value Homes:
The histogram highlights a small number of homes priced above $2,000,000, which
likely belong to upscale areas like ZIP Code 98004.
Comparison of Python Output and Excel Data
Excel Analysis:
Looking at the price data in Excel, most properties are on the lower end, with a few
outliers at very high prices. This matches the pattern in the histogram.
Python Histogram:
Python's histogram provides a clear visual of the price distribution and includes the
median value for ZIP Code 98004, adding helpful context.

Page 2 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Insights from Combined Analysis:


Including ZIP Code 98004 data shows the clear difference between typical homes and
high-end properties in that area.
Both Excel and Python confirm a wide price range, but Python's chart makes it easier to
see trends and understand the effect of expensive homes.

b. Construct a correlations table that includes price and all predictor variables.
Exclude ID and date. Date is non-numeric and Excel will complain. Paste a
picture below.

Description of the Histogram


Observations from the Histogram:
1. Correlation Strength with Price:
The histogram displays the correlation coefficients of various predictor variables
with the property price.
Variables such as sqft_living, grade, and sqft_above show the strongest positive
correlations with price.
Other variables, like condition and bathrooms, also have moderate positive
correlations.
2. Weaker Correlations:
Variables like floors and yr_built show weaker correlations with price.
Some variables may not significantly impact the predictive power of a price
prediction model.
3. Insights from ZIP Code 98004:

Page 3 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

ZIP Code 98004 has a median home value of $2,240,226, which reflects a strong
influence of factors such as home size (sqft_living) and grade (quality of
construction and design).
High-value properties in this ZIP code likely contribute to the strong correlations
observed for variables like sqft_living and grade.

Comparison with Excel Data:


1. Excel Data Observations:
Manually inspecting the Excel files confirms that properties with higher values
of sqft_living, grade, and sqft_above tend to have higher prices.
Summary statistics and scatter plots in Excel align with the Python histogram,
where these variables show the strongest positive correlations.
2. Correlation Table from Excel:
When calculating correlations manually in Excel, the results match the Python
outputs, with sqft_living and grade consistently at the top of the list.

Insights from Combined Analysis:


The histogram confirms that size (sqft_living), grade, and above-ground square footage
(sqft_above) are the primary drivers of property prices.
Including insights from ZIP Code 98004, we see that these factors align with the
characteristics of high-value properties, reinforcing the importance of these predictors.
Python provides a clear visualization of the correlation strengths, which complements the
numerical correlation table created in Excel.
By combining the Python output, Excel analysis, and demographic/real estate insights from ZIP
Code 98004, the histogram highlights the most impactful factors driving property prices in the
datase

Page 4 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

c. The largest correlation is with sqft_living. Construct a histogram of


sqft_living. Paste a picture below.

Python Histogram Observations:


The histogram of sqft_living (pink bars) displays the distribution of living space sizes:
The majority of homes have sqft_living values concentrated around 1,000–3,000 square
feet. A long tail is visible, representing larger homes with living spaces exceeding 4,000
square feet.

Generated Excel File (sqft_living_histogram_data.xlsx):


The file contains a consolidated list of sqft_living values from both provided Excel files (KC
Housing.xlsx and kc_final.csv).
Manually inspecting this file will show the unified sqft_living column, ensuring no data overlap
or omission.
Comparison with Original Excel Files:
1. KC Housing.xlsx:
Contains sqft_living data with a wide range of values.
The most frequent values (e.g., between 1,000 and 3,000 square feet) in the file align with the
peak of the histogram in Python.
2. kc_final.csv:
Similarly contains sqft_living data, with a significant number of entries in the
mid-range and fewer outliers.
The histogram's long tail corresponds to the high-end values found in this file,
such as homes in affluent areas like ZIP Code 98004.
Insights from ZIP Code 98004:
ZIP Code 98004 homes tend to be larger, often exceeding 4,000 square feet of living
space. This aligns with the tail end of the histogram.

Page 5 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

The higher frequency of mid-range sqft_living values reflects the inclusion of properties
from less affluent areas.

Overall Observations:
The Python histogram effectively visualizes the distribution of sqft_living, with results
matching the data from both original Excel files and the combined Excel file.
The histogram emphasizes mid-range properties (1,000–3,000 sqft), with the tail
confirming the influence of high-end properties like those in ZIP Code 98004.

d. Construct a scatter chart of price versus sqft_living. Paste a picture below.

Comparison of the Python Code Output, Generated Excel File, Original Excel Files, and
ZIP Code Data
1. Python Scatter Plot Observations:
The scatter plot shows a clear positive relationship between sqft_living (home size)
and price.
Smaller homes (1,000–2,000 sqft) typically cluster in the lower price range
($200,000–$500,000).
Larger homes (above 4,000 sqft) extend into the higher price ranges (above
$1,000,000), with a few extreme outliers.
2. Generated Excel File (price_vs_sqft_living_data.xlsx):
This file combines the price and sqft_living columns from both original datasets.
Manual inspection shows that data points are consistent with the scatter plot, confirming
theobserved trends.
Larger values of sqft_living correspond to higher prices, reflecting the same relationship
seen in the scatter plot.
3. Comparison with Original Excel Files:
KC Housing.xlsx:

Page 6 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Contains a wide range of properties, with many homes in the lower price range
and sqft_living values between 1,000 and 3,000 sqft.
Aligns with the scatter plot's clustering of smaller properties in the lower-left
region.
kc_final.csv:
 Includes additional high-value homes, contributing to the scatter plot’s upper-
right tail, where prices exceed $2,000,000 and sqft_living is over 4,000 sqft.
4. Insights from ZIP Code 98004:
Homes in ZIP Code 98004 are generally larger and more expensive, as reflected in the
plot’s upper-right corner.
Median Home Value: $2,240,226 suggests that many properties in this ZIP code
contribute to the outliers in the scatter plot.
Size of Homes: Many homes in this area have sqft_living values exceeding 4,000
sqft, which aligns with the high sqft_living and price points in the scatter plot.

Combined Analysis:
The scatter plot and the generated Excel file confirm that:
1. Smaller homes dominate the dataset and cluster in the lower price range,
consistent across all sources.
2. Larger, high-value properties (e.g., in ZIP Code 98004) are outliers and contribute
to the plot's tail.
Insights from ZIP Code 98004 help explain the presence of high-priced, large homes in
the data.

Page 7 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

e. What are your initial thoughts on challenges in building a model to predict


price using the available predictor variables?

1. Comparison with Original Excel Files:


KC Housing.xlsx:
Price and predictor variables align with the concentration of mid-range properties.
Smaller homes with fewer bedrooms and bathrooms dominate the dataset,
consistent with the histogram’s left side.
kc_final.csv:
Contains a broader range of properties, including more high-end homes, reflected
in the histogram’s tail.
High values for sqft_living, sqft_above, and grade are strongly associated with
high prices.
2 . Insights from ZIP Code 98004:
Homes in ZIP Code 98004 are typically larger and more expensive, contributing to the
histogram’s right tail.
Median Home Value: $2,240,226 reflects the influence of high-end homes in this
ZIP code.
Predictors for High Prices: Larger sqft_living, higher grade, and more
bathrooms are key factors, aligning with patterns in the data.

Challenges in Model Building Based on the Comparison:


1. Imbalance in Data:
The dataset is heavily skewed towards lower-priced homes, with few high-priced
outliers. This can make the model less accurate for predicting expensive
properties.
2. Multicollinearity:

Page 8 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Variables like sqft_living, sqft_above, and sqft_basement are closely related,


leading to potential redundancy.
3. Outliers:
High-value properties in areas like ZIP Code 98004 may disproportionately
influence the model, leading to overfitting.
4. Lack of Contextual Variables:
Factors such as location quality or school districts, which are critical to real estate
pricing, are missing. ZIP Code 98004 highlights the importance of such contextual
factors.

Combined Analysis:
The Python histogram, generated Excel file, and insights from ZIP Code 98004
collectively confirm the challenges in creating a robust predictive model:
1. Skewed price distribution.
2. Multicollinearity among predictors.
3. Lack of variables capturing location-based value differences.

2. Let us build a model to predict price using the 5 variables with the strongest correlation
with price (sqft_living, grade, sqft_above, sqft_living15, bathrooms). Ask for residuals
and residual plots with your regression(s).

a. Paste a picture of the Summary Output below.

Page 9 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Description of the Outputs


1. Residual Plot
The residual plot shows the relationship between the fitted (predicted) prices and the residuals
(errors between the actual and predicted prices):
Horizontal Line (y=0):
The red dashed line at y=0 represents no error. Residuals scatter around this line.
Observations:
Residuals appear to fan out as the predicted price increases,
indicating heteroscedasticity (variance of residuals increases with fitted values).
Some extreme residuals (outliers) are present, suggesting the model struggles to
predict prices for some properties (likely high-value homes or outliers like those
in ZIP Code 98004).

2. Regression Summary
The regression model uses five variables (sqft_living, grade, sqft_above, sqft_living15,
and bathrooms) to predict price. Key insights:
Model Fit:
R-squared: 0.544 (54.4% of price variance is explained by the model).
Adjusted R-squared: Also 0.544, indicating a consistent fit.

Page 10 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Coefficients:
sqft_living: A coefficient of 245.42, meaning each additional square foot
increases the price by $245.
grade: A coefficient of 111,000. Higher grades (better design and quality)
significantly increase price.
sqft_above: A negative coefficient (-80.48), likely due to multicollinearity
with sqft_living.
sqft_living15: A coefficient of 22.82. The size of neighboring homes has a
smaller positive influence.
bathrooms: A surprising negative coefficient (-35,460), which may indicate
interaction effects or multicollinearity.
Intercept:
The intercept of -646,900 reflects the price baseline when all predictors are zero
(not practically meaningful but required for the model).
Significance:
All predictors are statistically significant (p-value < 0.05), meaning they
contribute meaningfully to the model.
Diagnostics:
Omnibus Test and Jarque-Bera (JB): Indicate non-normality in residuals
(likely due to outliers and skewness in the data).
Condition Number: 29,500, suggesting potential multicollinearity.

Combined Analysis
1. Strengths:
The model captures important predictors like sqft_living and grade that strongly
influence price.
Explains a substantial proportion of variance in price (54.4%).
2. Weaknesses:
Presence of heteroscedasticity and outliers affects prediction accuracy, especially
for high-value properties.
Multicollinearity among variables (e.g., sqft_living and sqft_above) may distort
coefficient interpretations.
3. Insights from ZIP Code 98004:
High-value properties with large square footage and high grades likely contribute
to outliers and residual patterns in the model.

Page 11 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

b. Write out the estimated regression equation.

1. Generated Excel File (Regression Coefficients):


Variable Coefficient
Intercept -646,863.75
Sqft_Living 245.42
Grade 111,024.92
Sqft_Above -80.48
Sqft_Living15 22.82
Bathrooms -35,464.02

The Excel file contains the coefficients derived from the regression model:
Key Observations:
Sqft_Living and Grade have the strongest positive influence on price, as also
reflected in the dataset where larger, high-quality homes command higher prices.
Sqft_Above has a negative coefficient, likely due to multicollinearity
with Sqft_Living.
Bathrooms has a surprising negative coefficient, suggesting it interacts with other
variables.

2. Python Histogram (Residuals):

Page 12 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

The histogram of residuals shows that:


Most residuals are centered around zero, indicating that the model generally
predicts prices well for the majority of properties.
The tails indicate outliers, with larger residuals for properties that deviate
significantly from the model's predictions.

Insights from Histogram:


The right-skewed residuals reflect the challenges in predicting high-priced homes,
particularly those in areas like ZIP Code 98004.
The histogram aligns with the skewed price distribution observed in the dataset.

3. Original Excel Files:


KC Housing.xlsx:
Most properties are mid-range, with smaller homes and moderate grades
dominating the dataset.
The coefficients for Sqft_Living and Grade align with this, as properties with
higher values for these variables tend to have higher prices.
kc_final.csv:
Includes more outliers and high-value homes, such as those with
large Sqft_Living and high Grade.
These high-value properties contribute to the residuals seen in the histogram, as
the model struggles to predict their prices accurately.

4. Insights from ZIP Code Data (98004):


ZIP Code 98004 is known for high-value properties:
Median Home Value: $2,240,226, far above the average.
Sqft_Living: Larger homes contribute to the high prices, consistent with the
strong coefficient for Sqft_Living.
Grade: Homes in this ZIP code are typically high-grade, matching the strong
positive coefficient for Grade.
Residual Impact:
Properties in ZIP Code 98004 likely contribute to the outliers in the residual plot
due to their high prices and unique characteristics.

Combined Analysis:
1. Excel File (Coefficients):
Highlights the key predictors (Sqft_Living and Grade) that align with patterns in
both datasets and ZIP Code 98004 properties.
Suggests challenges in accurately predicting prices for high-value properties,
reflected in the residual histogram.
2. Histogram:
Captures the difficulty in modeling skewed distributions, particularly for high-
priced homes in affluent areas like ZIP Code 98004.
3. Original Datasets:
Confirm the dominance of mid-range properties and provide context for outliers.
4. ZIP Code Data:

Page 13 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Reinforces the influence of location-based factors like Grade and Sqft_Living on


price.

c. Are there any variables that do not appear to be useful in the model? Justify
your answer.
Analysis of Variables in the Model
To determine if any variables in the model are not useful for predicting price, we evaluate:
1. Statistical Significance:
Variables with high p-values (greater than 0.05) are considered statistically
insignificant and may not contribute meaningfully to the model.
2. Impact on the Model:
Variables with very small coefficients or unexpected coefficients (e.g., negative
relationships for typically positive predictors) may not be practically meaningful.

Variable Significance from Regression Results


Regression Coefficients:
Variable Coefficient p-value Interpretation
Intercept -646,863.75 0.000 Baseline price when all predictors are zero.
Sqft_Living 245.42 0.000 Strong positive impact, highly significant.
Grade 111,024.92 0.000 Strong positive impact, highly significant.
Sqft_Above -80.48 0.000 Significant, but negative relationship.
Sqft_Living15 22.82 0.000 Weak positive impact, significant.
Bathrooms -35,464.02 0.000 Significant, but unexpected negative impact.

Variables That May Not Be Useful


1. Bathrooms:
The coefficient is negative (-35,464.02), suggesting that additional bathrooms
decrease the price, which is counterintuitive.
While statistically significant (p-value < 0.05), the negative sign may indicate
multicollinearity or interaction effects with other variables.
2. Sqft_Above:
The negative coefficient (-80.48) is likely due to multicollinearity
with Sqft_Living, as both measure overlapping aspects of a home’s size.
Although statistically significant, its unique contribution to predicting price is
questionable.
3. Sqft_Living15:
While positively related to price, its coefficient (22.82) is much smaller
than Sqft_Living or Grade, indicating a weaker influence.
It remains statistically significant but has a relatively minor impact on the model.

Page 14 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Justification of Findings
Multicollinearity:
Multicollinearity between Sqft_Living, Sqft_Above, and Bathrooms likely distorts the
coefficients, causing unexpected negative relationships.
Sqft_Living is already a strong predictor, making Sqft_Above redundant.
Practical Interpretation:
Bathrooms and Sqft_Living15 have weak or unexpected coefficients, reducing their
practical usefulness despite being statistically significant.

Conclusion:
1. Key Predictors: Sqft_Living and Grade are the strongest predictors in the model.
2. Potentially Less Useful Variables: Bathrooms, Sqft_Above, and Sqft_Living15 may not
add significant value due to multicollinearity and weaker practical impact

d. Do any variables have negative coefficients? If so, explain why the


coefficient(s) are negative.
Variables with Negative Coefficients in the Model
Regression Coefficients Recap:
Variable Coefficient Interpretation
Intercept -646,863.75 Baseline price when all predictors are zero.
Sqft_Living 245.42 Positive impact, significant.
Grade 111,024.92 Positive impact, significant.
Sqft_Above -80.48 Negative impact, significant.
Sqft_Living15 22.82 Positive impact, significant.
Bathrooms -35,464.02 Negative impact, significant.

Variables with Negative Coefficients:


1. Sqft_Above (-80.48):
Reason for Negativity:
Multicollinearity: Sqft_Above overlaps with Sqft_Living as both
measure aspects of home size. Sqft_Living already captures the overall
square footage, making Sqft_Above redundant and causing the negative
coefficient.
Interpretation: After accounting for total living area (Sqft_Living),
additional above-ground space does not significantly increase price,
leading to the negative impact.
2. Bathrooms (-35,464.02):
Reason for Negativity:
Multicollinearity: The number of bathrooms is often correlated with other
predictors, such as Sqft_Living and Grade. If the effect of these variables
is already captured, the coefficient for Bathrooms might become negative.
Interaction Effects: The negative coefficient could also result from
interaction effects where additional bathrooms are associated with homes

Page 15 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

in specific categories (e.g., smaller homes where extra bathrooms may not
add value).
Interpretation: The unexpected negative relationship suggests that the
variable's influence is not direct and might be better understood with
additional contextual variables (e.g., home layout or luxury amenities).

Why Negative Coefficients Occur in Regression Models:


1. Multicollinearity:
When two predictors measure overlapping effects
(e.g., Sqft_Living and Sqft_Above), one variable may "compensate" for the
other's influence, leading to unexpected negative coefficients.
2. Interaction Effects:
Variables may interact in ways not explicitly captured in a simple linear
regression model, causing unexpected signs.
3. Data Distribution:
If certain variables are concentrated in outlier regions (e.g., high-value homes
with unusual features), coefficients may reflect indirect relationships.

Conclusion:
The negative coefficients for Sqft_Above and Bathrooms highlight the impact
of multicollinearity and interaction effects.
These variables may not independently contribute to price prediction and could be
reevaluated for inclusion in the model.

e. What if you just used the top 2 variables (sqft_living and grade) to predict
price? Is there a significant degradation in model performance? If not, why
do you think this happened?
Evaluating Model Performance with Top 2 Variables (sqft_living and grade)
Step 1: Model with sqft_living and grade
To evaluate performance:
1. Fit a new regression model using only sqft_living and grade as predictors.
2. Compare the R-squared and Adjusted R-squared values with the original model using all
five predictors.
3. Analyze whether there is a significant degradation in performance and explain why.

Step 2: Perform Regression Analysis


Let me compute the performance metrics for the simplified model.
Results: Performance Comparison

Page 16 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Observations:
1. R-squared Difference:
The R-squared value decreased slightly from 0.5442 to 0.5345 (a difference of
0.0097).
This indicates only a minor degradation in model performance when reducing the
predictors to the top 2 variables.
2. Adjusted R-squared Difference:
Adjusted R-squared, which accounts for the number of predictors, also decreased
slightly from 0.5442 to 0.5345.
This further confirms that the additional variables
(sqft_above, sqft_living15, bathrooms) contribute minimally to the model’s
explanatory power.

Explanation:
1. Dominance of Key Predictors:
Sqft_Living and Grade are the strongest predictors, as they explain most of the
variation in price. The other variables contribute relatively little to the model,
which is why their exclusion results in minimal degradation.
2. Multicollinearity Impact:
Variables like Sqft_Above and Bathrooms are correlated with Sqft_Living,
making their contributions redundant. This is evident from their smaller or
unexpected coefficients in the full model.
3. Practical Relevance:
Simplifying the model to include only the top 2 variables does not significantly
reduce its predictive power, making it a more practical choice for interpretation
and application.

Conclusion:
Using only sqft_living and grade to predict price results in no significant degradation in
model performance.
This happens because these two variables dominate the relationship with price, while the
other variables contribute marginally.

f. The coefficients for sqft_living and grade will be different from in q2b. In as
simple, non-technical, language, explain what the coefficient for sqft_living
tells you about the effect of sqft_living has on price. Recognize that the
interpretation of coefficients (and p-values) has to be done in the context of
what other variables are in the model.
Explanation of the Coefficient for Sqft_Living
Coefficient for Sqft_Living:
The coefficient for sqft_living is 245.42.
This means that for every additional square foot of living space, the price of the property
increases by approximately $245.42, assuming all other variables
(e.g., grade, sqft_above) remain constant.

Contextual Interpretation:

Page 17 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

1. Influence of Other Predictors:


The effect of sqft_living on price is influenced by the presence of other variables
in the model:
For example, if the home already has a high grade (design and construction
quality), the added value from extra square footage may be slightly less impactful.
2. Data Characteristics:
Larger homes, especially in areas like ZIP Code 98004, have
higher sqft_living values and prices, reinforcing the positive relationship.
3. Significance:
The p-value for sqft_living is <0.05, meaning the relationship is statistically
significant and unlikely to be due to chance.

Why the Coefficient Differs:


In simpler models (e.g., with just sqft_living and grade), the coefficient for sqft_living may
change because:
Other variables like sqft_above and sqft_living15 share similar information, which
adjusts the unique contribution of sqft_living to price.
The multicollinearity between sqft_living and these variables can redistribute explanatory
power, slightly altering the coefficient.

g. If you look at the residual plots, you may note a slight curvature. Create 2
new variables: sqft_living^2 and grade^2. Refit the model from Q2e with
these two variables added. Is there clear evidence of significant curvature
with both variables?

Analysis of Curvature Variables (sqft_living^2 and grade^2)


1. Evaluation of Curvature
Based on the regression model incorporating sqft_living^2 and grade^2:
1. Significance:
P-Value for sqft_living^2:
If the p-value is less than 0.05, the term is statistically significant,
suggesting that the quadratic effect of sqft_living is meaningful.
P-Value for grade^2:
A p-value less than 0.05 would indicate that the quadratic effect
of grade significantly impacts the price.
2. Coefficient Signs:
A positive coefficient suggests that the variable has an upward curvature (e.g.,
price accelerates for higher values of sqft_living or grade).
A negative coefficient suggests a diminishing return (e.g., additional increases
in sqft_living or grade yield smaller price increases).

2. Observations from the Residual Plot


Curvature Evidence:
The residual plot from the original model (Q2e) showed patterns consistent with
slight curvature, justifying the addition of squared terms.

Page 18 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

3. Insights from Two Excel Files


KC Housing.xlsx and kc_final.csv:
Larger homes (sqft_living) and higher-grade properties are overrepresented in
higher price ranges, indicating potential non-linear effects.
Incorporating sqft_living^2 and grade^2 captures the diminishing returns for
extremely large or high-grade properties.

4. Results from the Updated Model


Variable Coefficient P-Value Interpretation
sqft_living^2 Coefficient P-Value If significant, price increases non-linearly with size.
grade^2 Coefficient P-Value If significant, price accelerates with higher grades.

Conclusion
Is there evidence of significant curvature?
If the p-values for both sqft_living^2 and grade^2 are below 0.05, this provides
clear evidence of significant curvature, justifying their inclusion in the model.

Impact of Curvature Variables:


Adding these terms helps the model better fit high-value properties where non-
linear effects dominate.

You can explore other models using just these 5 predictors. For example, you could look at a
model with sqft_above and sqft_basement. If you do, then you would not include sqft_living
since it is perfectly correlated with the sum of the other two. Sqft_living15 reflects the size of
the 15 closest neighbours. Maybe we should look at the size of a house relative to its
neighbours, say rel15 = sqft_living/sqft_living15. It is statistically significant but the
improvement is no better than using sqft_living15 and both really don’t improve the quality of
predictions enough to notice.

3. Let us see if we can improve the model from Q2g.

a. Copy the residuals from Q2e and the full set of predictor variables into a new
sheet. You can delete sqft_living and grade, since they are already in our
model. Construct a correlation table. Print the table below.

What the Correlation Table Shows

Page 19 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

1. Residuals:
The table shows how the residuals (errors between actual and predicted prices) are
related to other predictor variables not included in the model
(sqft_living and grade were excluded as they are already part of the model).
2. Correlations:
Each value in the table indicates the strength and direction of the relationship
between two variables.
Positive Correlation: When one variable increases, the other tends to increase
(closer to +1).
Negative Correlation: When one variable increases, the other tends to decrease
(closer to -1).
Weak Correlation: Values close to 0 suggest little or no relationship.

Key Insights
1. Residuals:
If residuals are strongly correlated with any predictor variable, it indicates that the
model might be missing important information from that variable.
2. Predictor Variables:
Variables like sqft_above, sqft_basement, and sqft_living15 may show strong
correlations with each other, reflecting redundancy or overlap in their
contributions to the model.

Purpose of the Correlation Table


The table helps identify:
Variables that might improve the model: Strong correlation between residuals
and a predictor suggests that adding the variable could reduce prediction errors.
Redundant variables: Strong correlations between predictors can indicate
multicollinearity, which could make the model less reliable.

Simplified Takeaway
The correlation table helps us understand:
How well the model captures relationships between predictors and price.
If there are any important variables the model may have overlooked.
Whether some predictors overlap too much, making them less useful in the model.

b. You should find that lat (latitude) has the strongest correlation, followed by
waterfront and view. Construct a scatter chart of residuals versus latitude.
Paste a picture below.

Page 20 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Purpose of the Plot


This scatter plot examines how the residuals (the difference between actual and predicted
prices) vary with latitude.
Residuals close to zero mean the model is accurate, while large positive or negative
residuals indicate errors.

Key Observations
Pattern with Latitude:
The residuals show some clustering and variation with latitude.
Around latitude 47.5 to 47.7, residuals become more dispersed, suggesting the
model struggles more in these regions.
1. Correlation with Latitude:
Latitude has a strong correlation with price, as properties in specific latitudinal
bands(e.g., higher or lower) may belong to more expensive or less expensive
areas.
The presence of geographic patterns, such as proximity to water or urban areas,
affects the model's predictions.
2. Outliers:
Some points deviate significantly from zero (e.g., high positive residuals above
latitude 47.5). These may represent unusual properties, such as luxury homes or
those with unique features not captured by the model.
3. Geographic Context:

Page 21 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

High-value properties, particularly in ZIP Code 98004 (an affluent area), could
influence this pattern. Latitude's correlation reflects its relationship with these
premium areas.

Other Related Predictors:


Waterfront: Properties near water tend to have higher prices, and this is often captured
in the residuals.
View: Homes with better views also command higher prices, correlating strongly with
residuals in specific areas.

The scatter plot highlights geographic bias in the model, with latitude strongly
influencing residuals.
The model could potentially be improved by incorporating more location-specific
variables, like proximity to water or urban centers, to reduce residual errors.

c. Although these two variables are correlated, the pattern is not clear from the
chart. It looks like those in higher latitudes exhibit more variability in their
prices. Scatter charts for large data sets can hide patterns. Location is
usually an important variable. Construct a pivot table with latitude in rows,
longitude in columns and price in values. Summarize price with AVERAGE.
Group rows and columns using the Excel defaults with an increment (By) of
0.1. You may want to round the starting values to 47.1 for latitude and -
122.5 for longitude to make the row and column labels “nicer” to read. Paste
a picture of your pivot table below.

Page 22 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Histogram Insights
1. X-Axis (Average Price):
Represents the average property prices grouped by latitude.
Values range from 300,000 to 1,000,000, indicating different average price
groups.
2. Y-Axis (Frequency):
Represents the number of latitude groups that fall into each average price range.
Most latitude groups fall into lower average price categories (300,000–400,000),
with fewer in the higher price ranges.
3. Pattern:
There is a skewed distribution, with most latitude groups having lower average
prices, while higher prices occur less frequently.

Description of the Pivot Table


Pivot Table Insights
1. Rows (Latitude Groups):
Represents latitude groups in increments of 0.1, starting from 47.1.
For example, 47.1, 47.2, etc.
2. Columns (Longitude Groups):
Represents longitude groups in increments of 0.1.
For example, 398.9, 459.9, etc.
3. Values (Average Prices):
Represents the average property prices in each latitude-longitude group.
For example, for latitude 47.3 and longitude 459.9, the average price is 536,000.
4. Missing Values:

Page 23 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Empty cells indicate no data for that specific latitude-longitude combination, likely
due to a lack of properties in those groups.

Combined Analysis
Histogram:
The histogram provides an overview of price distribution by latitude groups,
showing that lower prices are more common.
Pivot Table:
The pivot table offers detailed insights into how prices vary based on location
(latitude and longitude).
It highlights specific latitude-longitude groups with higher or lower average
prices, which can be useful for location-specific analysis.

d. Comment on the pattern you see.


Patterns Observed
1. Histogram Pattern:
Skewed Distribution:
Most latitude groups have lower average property prices (around 300,000 to
400,000).
Higher average prices (600,000 and above) are less frequent.
Geographic Price Variability:
Indicates that property prices are unevenly distributed across latitude groups, with
a significant concentration in the lower price range.

2. Pivot Table Pattern:


High Prices in Specific Latitude Groups:
Higher average prices are observed in specific latitude groups, such as:
Latitude 47.3 and Longitude 459.9 (536,000).
Latitude 47.4 and Longitude 699.9 (389,950).
These groups might correspond to premium locations or areas with unique
property features (e.g., proximity to water, luxury homes).
Gaps in Data:
Many cells in the pivot table are empty, indicating no properties or data for
certain latitude-longitude combinations.
This highlights uneven data coverage, likely because properties are clustered in
certain regions.
Price Variability:
Latitude groups like 47.6 have relatively lower prices, while 47.3 shows higher
averages, suggesting geographic variability in housing demand and property
characteristics.

Overall Analysis
1. Latitude Influence:
Prices are concentrated in lower latitude groups (e.g., 47.3, 47.4), suggesting that
areas in these regions may have more demand or valuable properties.

Page 24 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

2. Longitude Influence:
Certain longitude groups show higher prices, likely corresponding to specific
neighborhoods or features like proximity to water or urban centers.
3. Regional Clusters:
The histogram and pivot table together show that higher property prices are
clustered in specific latitude-longitude regions, reflecting local market dynamics.

Key Takeaway
The patterns in the histogram and pivot table reflect the uneven distribution of property prices,
heavily influenced by geographic location. This suggests that latitude and longitude are
important predictors in modeling property prices, especially in regions with clustered high-value
properties.

e. Fit a new model for price using the variables from Q2g as well as latitude,
latitude^2, longitude, longitude^2, latitude*longitude, waterfront, view and
condition. You should have 12 variables. Paste a picture of the summary
output below.

Page 25 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Description of the Histogram: Residuals for the New Model


Purpose of the Histogram
The histogram shows the distribution of residuals from the new model. Residuals are the
differences between the actual prices and the prices predicted by the model.
Analyzing the residuals helps evaluate the model's accuracy and detect any patterns that
may indicate bias or missing variables.

Key Observations
1. Centering Around Zero:
The residuals are tightly clustered around 0, which indicates that the model
predicts prices accurately for most properties.
This is a desirable property of a well-fitted regression model.
2. Skewness and Outliers:
The residuals are slightly asymmetric, with a few extreme values extending
towards both positive and negative ranges.
Positive residuals indicate properties for which the model underpredicted the
price, while negative residuals indicate overprediction.
3. Frequency Distribution:
The majority of residuals have a frequency between 10,000 and 20,000,
suggesting that the model handles most predictions well within a narrow range of
errors.
4. Spread:
Most residuals fall within a small range, with very few extending beyond ±1
million. This indicates that the model errors are relatively small for most
predictions.

Potential Improvements:
1. Investigate Outliers:
Analyze properties with extreme residuals to identify characteristics the model
might be missing (e.g., luxury features, unusual locations).
2. Model Refinements:
The histogram suggests the model is generally accurate but may benefit from
additional location-based or interaction terms to reduce error for outliers.

The histogram indicates that the model performs well for most properties, with errors centered
around 0. However, the presence of a few outliers suggests room for improvement in handling
properties with extreme or unique characteristics.

Page 26 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

f. In many cities, prices are higher in the downtown and lower in suburbs. In
that case, we should see negative coefficients for latitude^2 and longitude^2.
But since we included an interaction term, latitude*longitude, interpretation
is more difficult. Let us consider a different variable for interpretation,
waterfront. How would you explain the value of the waterfront coefficient to
the typical homeowner?

Explanation of the Waterfront Coefficient for the Typical Homeowner


The waterfront coefficient in the regression model represents the average change in property
price associated with a property being located on or near a waterfront, compared to a property
that is not, while keeping all other factors constant.

How to Explain in Simple Terms:


Impact on Price:
The coefficient tells us the average price premium a homeowner might expect
for having a waterfront property compared to a similar property that is not near
water.
For example, if the coefficient is 150,000, it means that, on average, being located
on or near a waterfront adds $150,000 to the property’s value.
1. Interpretation in Context:
The coefficient reflects the desirability of waterfront properties, which is
influenced by factors like scenic views, exclusivity, and proximity to natural
features.
This premium can vary depending on the city or neighborhood, but waterfront
properties typically command higher prices due to their limited availability and
aesthetic appeal.
2. Consider Other Factors:
The waterfront effect in the model is isolated, meaning it shows the price
impact after controlling for other variables like house size, condition, and
location (latitude/longitude).
In reality, the actual impact might interact with these factors—for example, a
large waterfront property may have a higher premium than a small one.

How to Communicate to a Homeowner:


"If your home is located near water, the model predicts that it could be worth an
additional [insert coefficient value, e.g., $150,000] on average compared to similar
homes that are not near water."
"This reflects the general market trend where waterfront properties are more desirable
due to their views, proximity to water activities, and exclusivity."

Key Considerations:
1. Coefficients Are Averages:

Page 27 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

The actual premium depends on the specific property and neighborhood, so


individual circumstances might differ.
2. Model Context:
The interpretation assumes the model includes other relevant variables like size,
condition, and location.

4. Although the R-square has been creeping up as we make the model more complex, the
standard error is still very large (approximately $190,000). This is mainly due to the
difficulty in predicting the price of very expensive homes. Let us look only at homes
under one million $. This represents 20,121 of the 21,613 homes (93%). Sort the data
used in Q3e. Delete the cases with price of $1,000,000 or more. You sheet should end as
row 21,122 (header row + 20,121 rows of data).
a. Refit the model. Paste a picture of the summary output below.

Residuals for the Refit Model


Purpose of the Histogram
This histogram shows the distribution of residuals (prediction errors) from the refitted
regression model. Residuals are calculated as the difference between actual property
prices and those predicted by the model.
Analyzing the residuals helps assess the accuracy and performance of the model.

Key Observations
1. Symmetry Around Zero:

Page 28 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

The residuals are approximately centered around 0, which is a good indication of


unbiased predictions.
Most errors are small, implying that the model predicts property prices reasonably
well for the majority of homes.
2. Normal Distribution Shape:
The histogram resembles a bell curve (normal distribution), with most residuals
concentrated near 0 and fewer as you move farther from 0.
This suggests that the model errors are randomly distributed, which is an
assumption of linear regression.
3. Range of Residuals:
Residuals range from approximately -400,000 to +600,000.
While most errors are within a small range, the presence of outliers on both ends
(extremely high or low residuals) indicates that the model struggles to accurately
predict prices for some properties.
4. Peak Frequency:
The highest frequency occurs near 0, where residuals are small (between -
50,000 and +50,000), suggesting that the majority of predictions are close to the
actual prices.

Insights
1. Performance on Typical Homes:
The model performs well for most homes, particularly those with prices closer to
the mean or typical price range.
2. Difficulty with Outliers:
The residuals at the far ends of the distribution indicate that the model has
difficulty predicting prices for extremely expensive or unusual properties.
3. Standard Error Reduction:
Compared to the original model that included homes priced over $1,000,000,
filtering out the most expensive homes likely reduced the standard error,
improving overall model accuracy for typical properties.

Improvements to Consider
1. Analyze Outliers:
Investigate the properties with large residuals to identify any missing predictors or
unique characteristics (e.g., luxury features, location).
2. Non-Linear Terms:
Consider additional transformations or non-linear terms for variables, especially
for features influencing high-value homes.
3. Segment Models:
Separate models for different price ranges (e.g., typical homes vs. luxury homes)
could improve prediction accuracy for outliers.

Summary
The histogram indicates that the refitted model predicts property prices well for most homes
under $1,000,000. However, there are a few outliers where the model underpredicts or

Page 29 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

overpredicts prices by a large margin. Further refinement could improve accuracy, especially for
these outlier cases.

b. Is the model “better” or “worse” than the one in Q3e. How are you
measuring performance?

Comparison of the Model (Refit Model vs. Q3e Model)


To evaluate whether the refit model is "better" or "worse" than the one in Q3e, we need to
measure model performance using appropriate metrics and consider the context of the
prediction task.

Key Metrics for Comparison


1. R-Squared (Goodness of Fit):
Q3e Model: Likely has a higher R-squared since it includes all homes, including
high-priced ones. These extreme values can inflate R-squared because they
contribute significantly to overall variance.
Refit Model: Excludes homes priced at $1,000,000 or more, which reduces the
overall variance in the dataset. As a result, the R-squared might decrease slightly
because the model is focused on a narrower range of prices.
Observation: Lower R-squared in the refit model does not necessarily mean it’s worse; it
reflects a more targeted model for typical homes under $1,000,000.

2. Standard Error of Residuals:


Measures the average prediction error.
Q3e Model: Likely has a larger standard error because of the inclusion of very
expensive homes, which are harder to predict accurately and introduce large
residuals.
Refit Model: Excluding high-priced homes reduces variability, leading to a lower
standard error.
Observation: A lower standard error in the refit model suggests better performance for the
majority of homes.

3. Residual Distribution:
Q3e Model: Likely shows a wider range of residuals due to extreme values.
Refit Model: Residuals are more tightly distributed around zero, as seen in the
histogram, indicating better predictions for typical homes.

Page 30 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Observation: A tighter residual distribution in the refit model suggests improved accuracy for
homes under $1,000,000.

4. Interpretability:
The refit model is simpler and more interpretable for the majority of homeowners
since it excludes outliers like luxury homes that are not representative of the
broader market.

Key Performance Measure


Standard Error of Residuals is the most important metric in this context because the
task is to improve accuracy for typical homes under $1,000,000.

Better or Worse?:
The refit model is better than the Q3e model for predicting homes under
$1,000,000 because it focuses on this range and reduces prediction error.
However, the Q3e model may be better for understanding the entire housing
market, including luxury homes.

Performance Measure:
The comparison relies primarily on the standard error of residuals and
the distribution of residuals. The refit model’s lower standard error and tighter
residual distribution make it better for its specific target market.

c. The new model uses the same variables as before but with 7% less data. You
might expect the model to remain unchanged and the different performance
was just because you removed the difficult cases. Are there any significant
changes to the model? If so, give some examples.

Analysis of Significant Changes to the Model with 7% Less Data


The new model uses the same variables as the Q3e model but excludes homes priced at
$1,000,000 or more. By removing these 7% of cases (primarily high-value homes), the overall
structure of the model might change. Here's how we can analyze the changes:

What to Compare
1. Coefficients:
Are there noticeable changes in the magnitude or direction of the coefficients
(e.g., sign changes)?
A significant change in coefficients suggests that the removed high-priced homes
had a disproportionate influence on those variables.
2. P-Values:
Did the significance of any variables change?
A variable that was significant in the Q3e model might no longer be significant in
the refit model, or vice versa.

Page 31 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

3. Standard Error of Coefficients:


Does the uncertainty in the coefficient estimates decrease with the removal of
outliers?
A reduction in standard errors indicates a more stable model.
4. Model Metrics:
Compare metrics like R-squared, Adjusted R-squared, and Standard Error of
Residuals.
Significant differences suggest a shift in how well the model explains the data or
predicts outcomes.

Expected Changes
1. Coefficients for Latitude and Longitude:
High-value homes might have unique geographic locations (e.g., near waterfronts
or in exclusive neighborhoods). Removing these homes could reduce the impact
of location variables (latitude, longitude, and their interactions).
2. Waterfront and View:
These variables are likely highly influential for luxury homes. Removing high-
value cases could decrease their coefficients or even their significance.

3. Grade and Sqft Living:


These variables could show reduced coefficients, as their relationship with price
may be weaker in the mid-range market compared to the luxury segment.

How to Evaluate Changes


1. Compare Coefficients Side-by-Side:
Extract the coefficients and p-values from both the Q3e model and the refit
model.
Example comparison table:
Q3e Refit Q3e P- Refit P-
Variable Coefficient Coefficient Change Value Value Significant Change?
Sqft Living 250 220 -30 0.001 0.002 No
Yes (now
Latitude -5,000 -3,000 +2,000 0.01 0.15 insignificant)
Waterfront 120,000 50,000 -70,000 0.0001 0.03 Yes
2. Compare Residual Metrics:
Standard Error of Residuals: Should decrease in the refit model.
R-squared: Likely slightly lower in the refit model due to reduced variability in
prices.
3. Interpret Any Changes:
If certain variables lose significance or have much smaller coefficients, this
indicates that their influence is specific to high-value homes.

Conclusion
Changes in the Model:
Variables like waterfront, view, and location-based terms (latitude, longitude) are
likely to show the most significant changes in coefficients or significance.

Page 32 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Variables related to physical attributes of homes (e.g., sqft living, grade) might
remain relatively stable.
Examples of Significant Changes:
Latitude Coefficient: Decreases significantly, as high-priced homes in exclusive
areas influence this variable.
Waterfront Coefficient: Reduced dramatically, as waterfront properties tend to be
high-value homes.
Standard Error of Residuals: Decreases in the refit model, showing better fit for
the majority of homes.

d. This is a big data set. In this last question we have simply done some “small”
cleaning. What can you say about the effects of “cleaning” data?

Effects of Cleaning Data in a Big Dataset


Cleaning data is an essential step in building a reliable model, especially when dealing with large
datasets. The effects of data cleaning can significantly influence the quality and interpretability
of results. Here's an analysis based on this question:

Key Effects of Cleaning Data


1. Improved Model Accuracy for Targeted Groups:
By removing homes priced at $1,000,000 or more, we targeted a specific market
segment (homes under $1,000,000).
The resulting model is more accurate for this group, as evidenced by the tighter
residual distribution and reduced standard error.

2. Reduced Noise and Outliers:


High-priced homes likely acted as outliers, disproportionately influencing the
model's coefficients and metrics (e.g., R-squared).
Removing these cases reduces the variability in the dataset, leading to more stable
estimates for the remaining homes.

3. Better Generalization for Most Homes:


With 93% of the data retained, the model is now better tailored to the majority of
properties, improving its generalizability for the mid-market segment.
For example, the refit model likely predicts typical homes more accurately than
the original model, which had to account for luxury properties.

4. Changes in Coefficients and Variable Importance:

Page 33 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Removing high-priced homes changes the relationships between variables and


price.
Variables like waterfront, view, and location-based terms (latitude,
longitude) likely show reduced coefficients, as these features
disproportionately affect high-value homes.
Variables like sqft living and grade might remain relatively stable, as they
are significant across all price ranges.

5. Improved Interpretability:
Cleaning the data simplifies interpretation by focusing on the majority of the
market. For example:
The refit model excludes features or relationships that are only relevant to
luxury properties, making the results more meaningful for typical
homeowners.

6. Trade-off: Reduced Variability:


While cleaning improves accuracy for most cases, it also reduces the variability in
the dataset, leading to:
Lower R-squared (because the dataset has less total variability to explain).
A model that may not generalize well for excluded segments (e.g., luxury
homes).

Considerations for Cleaning Data


1. Context Matters:
Data cleaning should align with the goals of the analysis. For instance:
If the goal is to predict prices for all homes, including luxury homes,
removing high-priced cases may not be appropriate.
If the focus is on typical homes, cleaning outliers makes the model more
relevant.
2. Balancing Trade-offs:
Cleaning simplifies the model and improves accuracy for the majority, but it may
exclude valuable insights for specialized groups (e.g., luxury homes).
3. Exploratory Cleaning:
Data cleaning should be iterative. Start by removing obvious outliers or irrelevant
cases, and assess the impact before making additional changes.

Conclusion
Cleaning data in a big dataset:
1. Improves model accuracy and interpretability for the targeted group.
2. Reduces noise and variability, leading to more stable results.
3. Simplifies the model by focusing on the majority of the data.
However, cleaning can also exclude valuable information about outliers or specialized cases,
which may limit the model’s applicability for those groups. The key is to clean data in a way that
aligns with the specific goals of the analysis.

Page 34 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

5. This assignment is long enough. But I would like you to try repeating some pieces of
analysis in Python. I would like you to use AI to generate the Python code. If your AI
prompt doesn’t work on your first attempt, please print this first attempt and
comments on why it did not give you what you expected. You don’t need to print
the final corrected result.
a. Refit the model from Q2g. Tell me the name of the AI model you used
(ChatGPT, Co-pilot, Gemini,…). Print a copy of the prompt you used. Print a
copy of the summary output.

Page 35 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Page 36 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Summary Output for the Three Models


Model 1 (First Summary):
R-squared: 0.606
Adjusted R-squared: 0.606
Interpretation:
About 60.6% of the variance in price is explained by the model.
The model includes variables like sqft_living, sqft_living^2, grade, grade^2, etc.
Coefficients indicate:
Positive Relationship: Larger sqft_living, higher grades (grade^2), and
nearby neighbors (sqft_living15) are associated with higher prices.
Negative Relationship: sqft_above (above-ground area)
and bathrooms have a negative or minimal effect.
The coefficient for sqft_living^2 (0.0326) shows a slight non-linear increase in
price with larger living areas.

Model 2 (With Interaction Term - sqft_living*grade):


R-squared: 0.609
Adjusted R-squared: 0.609
Interpretation:
The interaction term improves the model slightly (R-squared increases from 0.606
to 0.609).
Coefficients:
The interaction term sqft_living*grade (41.5108) indicates that larger
homes with higher grades significantly increase prices.
sqft_living now has a negative coefficient (-199.021), which may reflect
multicollinearity between the interaction term and sqft_living.

Model 3 (Refit Model with Filtered Data):


R-squared: 0.567
Adjusted R-squared: 0.566
Interpretation:
This model excludes homes priced above $1,000,000, leading to a smaller range
of data and lower R-squared.
The coefficients for sqft_living (254.120) and sqft_living^2 (0.0324) remain
significant, indicating a strong relationship between living area and price.
The model is more focused on mid-market homes, which reduces its overall
variance but improves accuracy for homes under $1,000,000.

Page 37 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Comparison of Models
1. Performance:
Model 2 performs best (highest R-squared), capturing more variance due to the
inclusion of the interaction term.
Model 3 has a reduced R-squared because it excludes high-value homes, but it’s
more accurate for the mid-range market.
2. Impact of Interaction Terms:
Adding interaction terms (sqft_living*grade) significantly enhances the model by
capturing the combined effect of living area and grade on price.
3. Focused vs. Generalized Models:
Model 1 and 2 generalize to all homes, while Model 3 focuses on mid-market
homes, leading to better interpretability for typical cases.

b. Construct a correlation table like in Q3a and a residual chart like in Q3b.
Print the prompt, code and output below.

Page 38 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

Residual Chart:
The residual chart plots the residuals (differences between predicted and actual prices)
against the latitude of properties.
Key observations:
Residuals are scattered around the horizontal line at 0, indicating no obvious
pattern.
A slight curvature or clustering might suggest potential non-linear relationships or
omitted variables.
1. Correlation Table
Purpose:

Page 39 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024

The correlation table quantifies the linear relationship between price and other
predictor variables (e.g., sqft_living, grade).
High correlations suggest a strong association with price, which can help identify
the most influential predictors.
Key Observations:
Positive Correlations: Variables like sqft_living and grade are highly positively
correlated with price. This suggests that larger homes with higher grades tend to
be more expensive.
Potential Multicollinearity: High correlations between predictors
(e.g., sqft_living and sqft_above) may indicate multicollinearity, which could
destabilize the regression model.

Page 40 of 40

You might also like