KC Housing Assignment Fall 2024
KC Housing Assignment Fall 2024
Assignment #1
Due: Tuesday, December 2nd at 11:59pm
The term is almost over so there will only be one “assignment”. The purpose is primarily to get
you to practice use of tools and understanding results. In this assignment, I also want to have
you experience the wandering pathway you may follow in building a predictive model. The data
we will use is a popular dataset that you can find on Kaggle. You can also find numerous data
analyses and models for this dataset on Kaggle and elsewhere.
https://www.kaggle.com/datasets/astronautelvis/kc-house-data
The data set contains information about 20 characteristics of 21,613 properties. The data is from
Kings County in Washington. You may find it useful to actually look at the area on a map. The
data set has the latitude and longitude as well as zip code for each property. For example, I
googled the zip code for some of the most expensive properties and got this map.
https://www.unitedstateszipcodes.org/98004/
In the assignment, I have asked you to do a variety of tasks using Excel. Later I have asked you
to repeat some of this work using Python. Having done it first in Excel will give you a sense of
what the Python output should look like (so you will know if Python did what you thought it
should do). For the Python, I ask that you use a prompt to AI (Co-pilot, ChatGPT, or any other
LLM that you choose). I have asked to see your prompt, the Python code and the output. If
your attempt yields different results from Excel, I ask that you submit your original
prompt and commentary on why the results are not what you expected. I want to see your
learning process. In the future you will use AI to write code and the challenge will be knowing
when the outcome is what you thought you asked for. Don’t give me your corrected version – I
am sure you will figure it out.
With Excel, I recommend that for each question, you copy the necessary data into
a new worksheet and label the sheet with the question number. Put the output of
any analyses on the same page. Do not do your work directly on the data page,
since you risk accidentally corrupting the data.
Print/paste your answers directly into this Word file. It makes it easier for me to grade
when I can see the question you are responding to.
Page 1 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Page 2 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
b. Construct a correlations table that includes price and all predictor variables.
Exclude ID and date. Date is non-numeric and Excel will complain. Paste a
picture below.
Page 3 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
ZIP Code 98004 has a median home value of $2,240,226, which reflects a strong
influence of factors such as home size (sqft_living) and grade (quality of
construction and design).
High-value properties in this ZIP code likely contribute to the strong correlations
observed for variables like sqft_living and grade.
Page 4 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Page 5 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
The higher frequency of mid-range sqft_living values reflects the inclusion of properties
from less affluent areas.
Overall Observations:
The Python histogram effectively visualizes the distribution of sqft_living, with results
matching the data from both original Excel files and the combined Excel file.
The histogram emphasizes mid-range properties (1,000–3,000 sqft), with the tail
confirming the influence of high-end properties like those in ZIP Code 98004.
Comparison of the Python Code Output, Generated Excel File, Original Excel Files, and
ZIP Code Data
1. Python Scatter Plot Observations:
The scatter plot shows a clear positive relationship between sqft_living (home size)
and price.
Smaller homes (1,000–2,000 sqft) typically cluster in the lower price range
($200,000–$500,000).
Larger homes (above 4,000 sqft) extend into the higher price ranges (above
$1,000,000), with a few extreme outliers.
2. Generated Excel File (price_vs_sqft_living_data.xlsx):
This file combines the price and sqft_living columns from both original datasets.
Manual inspection shows that data points are consistent with the scatter plot, confirming
theobserved trends.
Larger values of sqft_living correspond to higher prices, reflecting the same relationship
seen in the scatter plot.
3. Comparison with Original Excel Files:
KC Housing.xlsx:
Page 6 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Contains a wide range of properties, with many homes in the lower price range
and sqft_living values between 1,000 and 3,000 sqft.
Aligns with the scatter plot's clustering of smaller properties in the lower-left
region.
kc_final.csv:
Includes additional high-value homes, contributing to the scatter plot’s upper-
right tail, where prices exceed $2,000,000 and sqft_living is over 4,000 sqft.
4. Insights from ZIP Code 98004:
Homes in ZIP Code 98004 are generally larger and more expensive, as reflected in the
plot’s upper-right corner.
Median Home Value: $2,240,226 suggests that many properties in this ZIP code
contribute to the outliers in the scatter plot.
Size of Homes: Many homes in this area have sqft_living values exceeding 4,000
sqft, which aligns with the high sqft_living and price points in the scatter plot.
Combined Analysis:
The scatter plot and the generated Excel file confirm that:
1. Smaller homes dominate the dataset and cluster in the lower price range,
consistent across all sources.
2. Larger, high-value properties (e.g., in ZIP Code 98004) are outliers and contribute
to the plot's tail.
Insights from ZIP Code 98004 help explain the presence of high-priced, large homes in
the data.
Page 7 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Page 8 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Combined Analysis:
The Python histogram, generated Excel file, and insights from ZIP Code 98004
collectively confirm the challenges in creating a robust predictive model:
1. Skewed price distribution.
2. Multicollinearity among predictors.
3. Lack of variables capturing location-based value differences.
2. Let us build a model to predict price using the 5 variables with the strongest correlation
with price (sqft_living, grade, sqft_above, sqft_living15, bathrooms). Ask for residuals
and residual plots with your regression(s).
Page 9 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
2. Regression Summary
The regression model uses five variables (sqft_living, grade, sqft_above, sqft_living15,
and bathrooms) to predict price. Key insights:
Model Fit:
R-squared: 0.544 (54.4% of price variance is explained by the model).
Adjusted R-squared: Also 0.544, indicating a consistent fit.
Page 10 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Coefficients:
sqft_living: A coefficient of 245.42, meaning each additional square foot
increases the price by $245.
grade: A coefficient of 111,000. Higher grades (better design and quality)
significantly increase price.
sqft_above: A negative coefficient (-80.48), likely due to multicollinearity
with sqft_living.
sqft_living15: A coefficient of 22.82. The size of neighboring homes has a
smaller positive influence.
bathrooms: A surprising negative coefficient (-35,460), which may indicate
interaction effects or multicollinearity.
Intercept:
The intercept of -646,900 reflects the price baseline when all predictors are zero
(not practically meaningful but required for the model).
Significance:
All predictors are statistically significant (p-value < 0.05), meaning they
contribute meaningfully to the model.
Diagnostics:
Omnibus Test and Jarque-Bera (JB): Indicate non-normality in residuals
(likely due to outliers and skewness in the data).
Condition Number: 29,500, suggesting potential multicollinearity.
Combined Analysis
1. Strengths:
The model captures important predictors like sqft_living and grade that strongly
influence price.
Explains a substantial proportion of variance in price (54.4%).
2. Weaknesses:
Presence of heteroscedasticity and outliers affects prediction accuracy, especially
for high-value properties.
Multicollinearity among variables (e.g., sqft_living and sqft_above) may distort
coefficient interpretations.
3. Insights from ZIP Code 98004:
High-value properties with large square footage and high grades likely contribute
to outliers and residual patterns in the model.
Page 11 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
The Excel file contains the coefficients derived from the regression model:
Key Observations:
Sqft_Living and Grade have the strongest positive influence on price, as also
reflected in the dataset where larger, high-quality homes command higher prices.
Sqft_Above has a negative coefficient, likely due to multicollinearity
with Sqft_Living.
Bathrooms has a surprising negative coefficient, suggesting it interacts with other
variables.
Page 12 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Combined Analysis:
1. Excel File (Coefficients):
Highlights the key predictors (Sqft_Living and Grade) that align with patterns in
both datasets and ZIP Code 98004 properties.
Suggests challenges in accurately predicting prices for high-value properties,
reflected in the residual histogram.
2. Histogram:
Captures the difficulty in modeling skewed distributions, particularly for high-
priced homes in affluent areas like ZIP Code 98004.
3. Original Datasets:
Confirm the dominance of mid-range properties and provide context for outliers.
4. ZIP Code Data:
Page 13 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
c. Are there any variables that do not appear to be useful in the model? Justify
your answer.
Analysis of Variables in the Model
To determine if any variables in the model are not useful for predicting price, we evaluate:
1. Statistical Significance:
Variables with high p-values (greater than 0.05) are considered statistically
insignificant and may not contribute meaningfully to the model.
2. Impact on the Model:
Variables with very small coefficients or unexpected coefficients (e.g., negative
relationships for typically positive predictors) may not be practically meaningful.
Page 14 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Justification of Findings
Multicollinearity:
Multicollinearity between Sqft_Living, Sqft_Above, and Bathrooms likely distorts the
coefficients, causing unexpected negative relationships.
Sqft_Living is already a strong predictor, making Sqft_Above redundant.
Practical Interpretation:
Bathrooms and Sqft_Living15 have weak or unexpected coefficients, reducing their
practical usefulness despite being statistically significant.
Conclusion:
1. Key Predictors: Sqft_Living and Grade are the strongest predictors in the model.
2. Potentially Less Useful Variables: Bathrooms, Sqft_Above, and Sqft_Living15 may not
add significant value due to multicollinearity and weaker practical impact
Page 15 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
in specific categories (e.g., smaller homes where extra bathrooms may not
add value).
Interpretation: The unexpected negative relationship suggests that the
variable's influence is not direct and might be better understood with
additional contextual variables (e.g., home layout or luxury amenities).
Conclusion:
The negative coefficients for Sqft_Above and Bathrooms highlight the impact
of multicollinearity and interaction effects.
These variables may not independently contribute to price prediction and could be
reevaluated for inclusion in the model.
e. What if you just used the top 2 variables (sqft_living and grade) to predict
price? Is there a significant degradation in model performance? If not, why
do you think this happened?
Evaluating Model Performance with Top 2 Variables (sqft_living and grade)
Step 1: Model with sqft_living and grade
To evaluate performance:
1. Fit a new regression model using only sqft_living and grade as predictors.
2. Compare the R-squared and Adjusted R-squared values with the original model using all
five predictors.
3. Analyze whether there is a significant degradation in performance and explain why.
Page 16 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Observations:
1. R-squared Difference:
The R-squared value decreased slightly from 0.5442 to 0.5345 (a difference of
0.0097).
This indicates only a minor degradation in model performance when reducing the
predictors to the top 2 variables.
2. Adjusted R-squared Difference:
Adjusted R-squared, which accounts for the number of predictors, also decreased
slightly from 0.5442 to 0.5345.
This further confirms that the additional variables
(sqft_above, sqft_living15, bathrooms) contribute minimally to the model’s
explanatory power.
Explanation:
1. Dominance of Key Predictors:
Sqft_Living and Grade are the strongest predictors, as they explain most of the
variation in price. The other variables contribute relatively little to the model,
which is why their exclusion results in minimal degradation.
2. Multicollinearity Impact:
Variables like Sqft_Above and Bathrooms are correlated with Sqft_Living,
making their contributions redundant. This is evident from their smaller or
unexpected coefficients in the full model.
3. Practical Relevance:
Simplifying the model to include only the top 2 variables does not significantly
reduce its predictive power, making it a more practical choice for interpretation
and application.
Conclusion:
Using only sqft_living and grade to predict price results in no significant degradation in
model performance.
This happens because these two variables dominate the relationship with price, while the
other variables contribute marginally.
f. The coefficients for sqft_living and grade will be different from in q2b. In as
simple, non-technical, language, explain what the coefficient for sqft_living
tells you about the effect of sqft_living has on price. Recognize that the
interpretation of coefficients (and p-values) has to be done in the context of
what other variables are in the model.
Explanation of the Coefficient for Sqft_Living
Coefficient for Sqft_Living:
The coefficient for sqft_living is 245.42.
This means that for every additional square foot of living space, the price of the property
increases by approximately $245.42, assuming all other variables
(e.g., grade, sqft_above) remain constant.
Contextual Interpretation:
Page 17 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
g. If you look at the residual plots, you may note a slight curvature. Create 2
new variables: sqft_living^2 and grade^2. Refit the model from Q2e with
these two variables added. Is there clear evidence of significant curvature
with both variables?
Page 18 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Conclusion
Is there evidence of significant curvature?
If the p-values for both sqft_living^2 and grade^2 are below 0.05, this provides
clear evidence of significant curvature, justifying their inclusion in the model.
You can explore other models using just these 5 predictors. For example, you could look at a
model with sqft_above and sqft_basement. If you do, then you would not include sqft_living
since it is perfectly correlated with the sum of the other two. Sqft_living15 reflects the size of
the 15 closest neighbours. Maybe we should look at the size of a house relative to its
neighbours, say rel15 = sqft_living/sqft_living15. It is statistically significant but the
improvement is no better than using sqft_living15 and both really don’t improve the quality of
predictions enough to notice.
a. Copy the residuals from Q2e and the full set of predictor variables into a new
sheet. You can delete sqft_living and grade, since they are already in our
model. Construct a correlation table. Print the table below.
Page 19 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
1. Residuals:
The table shows how the residuals (errors between actual and predicted prices) are
related to other predictor variables not included in the model
(sqft_living and grade were excluded as they are already part of the model).
2. Correlations:
Each value in the table indicates the strength and direction of the relationship
between two variables.
Positive Correlation: When one variable increases, the other tends to increase
(closer to +1).
Negative Correlation: When one variable increases, the other tends to decrease
(closer to -1).
Weak Correlation: Values close to 0 suggest little or no relationship.
Key Insights
1. Residuals:
If residuals are strongly correlated with any predictor variable, it indicates that the
model might be missing important information from that variable.
2. Predictor Variables:
Variables like sqft_above, sqft_basement, and sqft_living15 may show strong
correlations with each other, reflecting redundancy or overlap in their
contributions to the model.
Simplified Takeaway
The correlation table helps us understand:
How well the model captures relationships between predictors and price.
If there are any important variables the model may have overlooked.
Whether some predictors overlap too much, making them less useful in the model.
b. You should find that lat (latitude) has the strongest correlation, followed by
waterfront and view. Construct a scatter chart of residuals versus latitude.
Paste a picture below.
Page 20 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Key Observations
Pattern with Latitude:
The residuals show some clustering and variation with latitude.
Around latitude 47.5 to 47.7, residuals become more dispersed, suggesting the
model struggles more in these regions.
1. Correlation with Latitude:
Latitude has a strong correlation with price, as properties in specific latitudinal
bands(e.g., higher or lower) may belong to more expensive or less expensive
areas.
The presence of geographic patterns, such as proximity to water or urban areas,
affects the model's predictions.
2. Outliers:
Some points deviate significantly from zero (e.g., high positive residuals above
latitude 47.5). These may represent unusual properties, such as luxury homes or
those with unique features not captured by the model.
3. Geographic Context:
Page 21 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
High-value properties, particularly in ZIP Code 98004 (an affluent area), could
influence this pattern. Latitude's correlation reflects its relationship with these
premium areas.
The scatter plot highlights geographic bias in the model, with latitude strongly
influencing residuals.
The model could potentially be improved by incorporating more location-specific
variables, like proximity to water or urban centers, to reduce residual errors.
c. Although these two variables are correlated, the pattern is not clear from the
chart. It looks like those in higher latitudes exhibit more variability in their
prices. Scatter charts for large data sets can hide patterns. Location is
usually an important variable. Construct a pivot table with latitude in rows,
longitude in columns and price in values. Summarize price with AVERAGE.
Group rows and columns using the Excel defaults with an increment (By) of
0.1. You may want to round the starting values to 47.1 for latitude and -
122.5 for longitude to make the row and column labels “nicer” to read. Paste
a picture of your pivot table below.
Page 22 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Histogram Insights
1. X-Axis (Average Price):
Represents the average property prices grouped by latitude.
Values range from 300,000 to 1,000,000, indicating different average price
groups.
2. Y-Axis (Frequency):
Represents the number of latitude groups that fall into each average price range.
Most latitude groups fall into lower average price categories (300,000–400,000),
with fewer in the higher price ranges.
3. Pattern:
There is a skewed distribution, with most latitude groups having lower average
prices, while higher prices occur less frequently.
Page 23 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Empty cells indicate no data for that specific latitude-longitude combination, likely
due to a lack of properties in those groups.
Combined Analysis
Histogram:
The histogram provides an overview of price distribution by latitude groups,
showing that lower prices are more common.
Pivot Table:
The pivot table offers detailed insights into how prices vary based on location
(latitude and longitude).
It highlights specific latitude-longitude groups with higher or lower average
prices, which can be useful for location-specific analysis.
Overall Analysis
1. Latitude Influence:
Prices are concentrated in lower latitude groups (e.g., 47.3, 47.4), suggesting that
areas in these regions may have more demand or valuable properties.
Page 24 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
2. Longitude Influence:
Certain longitude groups show higher prices, likely corresponding to specific
neighborhoods or features like proximity to water or urban centers.
3. Regional Clusters:
The histogram and pivot table together show that higher property prices are
clustered in specific latitude-longitude regions, reflecting local market dynamics.
Key Takeaway
The patterns in the histogram and pivot table reflect the uneven distribution of property prices,
heavily influenced by geographic location. This suggests that latitude and longitude are
important predictors in modeling property prices, especially in regions with clustered high-value
properties.
e. Fit a new model for price using the variables from Q2g as well as latitude,
latitude^2, longitude, longitude^2, latitude*longitude, waterfront, view and
condition. You should have 12 variables. Paste a picture of the summary
output below.
Page 25 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Key Observations
1. Centering Around Zero:
The residuals are tightly clustered around 0, which indicates that the model
predicts prices accurately for most properties.
This is a desirable property of a well-fitted regression model.
2. Skewness and Outliers:
The residuals are slightly asymmetric, with a few extreme values extending
towards both positive and negative ranges.
Positive residuals indicate properties for which the model underpredicted the
price, while negative residuals indicate overprediction.
3. Frequency Distribution:
The majority of residuals have a frequency between 10,000 and 20,000,
suggesting that the model handles most predictions well within a narrow range of
errors.
4. Spread:
Most residuals fall within a small range, with very few extending beyond ±1
million. This indicates that the model errors are relatively small for most
predictions.
Potential Improvements:
1. Investigate Outliers:
Analyze properties with extreme residuals to identify characteristics the model
might be missing (e.g., luxury features, unusual locations).
2. Model Refinements:
The histogram suggests the model is generally accurate but may benefit from
additional location-based or interaction terms to reduce error for outliers.
The histogram indicates that the model performs well for most properties, with errors centered
around 0. However, the presence of a few outliers suggests room for improvement in handling
properties with extreme or unique characteristics.
Page 26 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
f. In many cities, prices are higher in the downtown and lower in suburbs. In
that case, we should see negative coefficients for latitude^2 and longitude^2.
But since we included an interaction term, latitude*longitude, interpretation
is more difficult. Let us consider a different variable for interpretation,
waterfront. How would you explain the value of the waterfront coefficient to
the typical homeowner?
Key Considerations:
1. Coefficients Are Averages:
Page 27 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
4. Although the R-square has been creeping up as we make the model more complex, the
standard error is still very large (approximately $190,000). This is mainly due to the
difficulty in predicting the price of very expensive homes. Let us look only at homes
under one million $. This represents 20,121 of the 21,613 homes (93%). Sort the data
used in Q3e. Delete the cases with price of $1,000,000 or more. You sheet should end as
row 21,122 (header row + 20,121 rows of data).
a. Refit the model. Paste a picture of the summary output below.
Key Observations
1. Symmetry Around Zero:
Page 28 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Insights
1. Performance on Typical Homes:
The model performs well for most homes, particularly those with prices closer to
the mean or typical price range.
2. Difficulty with Outliers:
The residuals at the far ends of the distribution indicate that the model has
difficulty predicting prices for extremely expensive or unusual properties.
3. Standard Error Reduction:
Compared to the original model that included homes priced over $1,000,000,
filtering out the most expensive homes likely reduced the standard error,
improving overall model accuracy for typical properties.
Improvements to Consider
1. Analyze Outliers:
Investigate the properties with large residuals to identify any missing predictors or
unique characteristics (e.g., luxury features, location).
2. Non-Linear Terms:
Consider additional transformations or non-linear terms for variables, especially
for features influencing high-value homes.
3. Segment Models:
Separate models for different price ranges (e.g., typical homes vs. luxury homes)
could improve prediction accuracy for outliers.
Summary
The histogram indicates that the refitted model predicts property prices well for most homes
under $1,000,000. However, there are a few outliers where the model underpredicts or
Page 29 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
overpredicts prices by a large margin. Further refinement could improve accuracy, especially for
these outlier cases.
b. Is the model “better” or “worse” than the one in Q3e. How are you
measuring performance?
3. Residual Distribution:
Q3e Model: Likely shows a wider range of residuals due to extreme values.
Refit Model: Residuals are more tightly distributed around zero, as seen in the
histogram, indicating better predictions for typical homes.
Page 30 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Observation: A tighter residual distribution in the refit model suggests improved accuracy for
homes under $1,000,000.
4. Interpretability:
The refit model is simpler and more interpretable for the majority of homeowners
since it excludes outliers like luxury homes that are not representative of the
broader market.
Better or Worse?:
The refit model is better than the Q3e model for predicting homes under
$1,000,000 because it focuses on this range and reduces prediction error.
However, the Q3e model may be better for understanding the entire housing
market, including luxury homes.
Performance Measure:
The comparison relies primarily on the standard error of residuals and
the distribution of residuals. The refit model’s lower standard error and tighter
residual distribution make it better for its specific target market.
c. The new model uses the same variables as before but with 7% less data. You
might expect the model to remain unchanged and the different performance
was just because you removed the difficult cases. Are there any significant
changes to the model? If so, give some examples.
What to Compare
1. Coefficients:
Are there noticeable changes in the magnitude or direction of the coefficients
(e.g., sign changes)?
A significant change in coefficients suggests that the removed high-priced homes
had a disproportionate influence on those variables.
2. P-Values:
Did the significance of any variables change?
A variable that was significant in the Q3e model might no longer be significant in
the refit model, or vice versa.
Page 31 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Expected Changes
1. Coefficients for Latitude and Longitude:
High-value homes might have unique geographic locations (e.g., near waterfronts
or in exclusive neighborhoods). Removing these homes could reduce the impact
of location variables (latitude, longitude, and their interactions).
2. Waterfront and View:
These variables are likely highly influential for luxury homes. Removing high-
value cases could decrease their coefficients or even their significance.
Conclusion
Changes in the Model:
Variables like waterfront, view, and location-based terms (latitude, longitude) are
likely to show the most significant changes in coefficients or significance.
Page 32 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Variables related to physical attributes of homes (e.g., sqft living, grade) might
remain relatively stable.
Examples of Significant Changes:
Latitude Coefficient: Decreases significantly, as high-priced homes in exclusive
areas influence this variable.
Waterfront Coefficient: Reduced dramatically, as waterfront properties tend to be
high-value homes.
Standard Error of Residuals: Decreases in the refit model, showing better fit for
the majority of homes.
d. This is a big data set. In this last question we have simply done some “small”
cleaning. What can you say about the effects of “cleaning” data?
Page 33 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
5. Improved Interpretability:
Cleaning the data simplifies interpretation by focusing on the majority of the
market. For example:
The refit model excludes features or relationships that are only relevant to
luxury properties, making the results more meaningful for typical
homeowners.
Conclusion
Cleaning data in a big dataset:
1. Improves model accuracy and interpretability for the targeted group.
2. Reduces noise and variability, leading to more stable results.
3. Simplifies the model by focusing on the majority of the data.
However, cleaning can also exclude valuable information about outliers or specialized cases,
which may limit the model’s applicability for those groups. The key is to clean data in a way that
aligns with the specific goals of the analysis.
Page 34 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
5. This assignment is long enough. But I would like you to try repeating some pieces of
analysis in Python. I would like you to use AI to generate the Python code. If your AI
prompt doesn’t work on your first attempt, please print this first attempt and
comments on why it did not give you what you expected. You don’t need to print
the final corrected result.
a. Refit the model from Q2g. Tell me the name of the AI model you used
(ChatGPT, Co-pilot, Gemini,…). Print a copy of the prompt you used. Print a
copy of the summary output.
Page 35 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Page 36 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Page 37 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Comparison of Models
1. Performance:
Model 2 performs best (highest R-squared), capturing more variance due to the
inclusion of the interaction term.
Model 3 has a reduced R-squared because it excludes high-value homes, but it’s
more accurate for the mid-range market.
2. Impact of Interaction Terms:
Adding interaction terms (sqft_living*grade) significantly enhances the model by
capturing the combined effect of living area and grade on price.
3. Focused vs. Generalized Models:
Model 1 and 2 generalize to all homes, while Model 3 focuses on mid-market
homes, leading to better interpretability for typical cases.
b. Construct a correlation table like in Q3a and a residual chart like in Q3b.
Print the prompt, code and output below.
Page 38 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
Residual Chart:
The residual chart plots the residuals (differences between predicted and actual prices)
against the latitude of properties.
Key observations:
Residuals are scattered around the horizontal line at 0, indicating no obvious
pattern.
A slight curvature or clustering might suggest potential non-linear relationships or
omitted variables.
1. Correlation Table
Purpose:
Page 39 of 40
MBAN 5520 Statistics and Predictive Analytics Fall 2024
The correlation table quantifies the linear relationship between price and other
predictor variables (e.g., sqft_living, grade).
High correlations suggest a strong association with price, which can help identify
the most influential predictors.
Key Observations:
Positive Correlations: Variables like sqft_living and grade are highly positively
correlated with price. This suggests that larger homes with higher grades tend to
be more expensive.
Potential Multicollinearity: High correlations between predictors
(e.g., sqft_living and sqft_above) may indicate multicollinearity, which could
destabilize the regression model.
Page 40 of 40