Task 1– Data Analytics in Python
Task 1– Data Analytics in Python
STU218659
Lee Braiden
Investigating the Manchester Housing Market
The main goal of this report is to examine the Manchester Housing dataset and offer
insights to help make informed decisions. The analysis is based on the CRISP DM (Cross
Industry Standard Process, for Data Mining) framework encompassing stages like Business
Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.
Within this report are statistical examinations the application of the Central Limit Theorem and
Python utilization, for data analysis.
The main objective is to pinpoint the elements that impact property values in Manchester
specifically looking at features, like footage, construction year, proximity to water and available
amenities. This study seeks to provide insights, for pricing tactics, real estate development
choices and potential investment prospects.
Data Understanding
Dataset Overview
• Price
• Waterfront status
• Floor Space
• Year Built
• Bedrooms
• Bathrooms
• Location
• Property Type
• Condition
• Lot Size
• Amenities
First, we loaded the dataset and displayed the first 10 rows for initial inspection.
Descriptive Statistics
In this study we analyzed the statistics, for waterfront homes to get insights, into their
characteristics and variations. The findings revealed that waterfront properties generally
command prices offer spacious living areas and come with a greater range of amenities
compared to non-waterfront properties.
Data Preparation
It is important to find and fill in missing values accurately for analysis. We replaced missing
values, with the occurring value for categorical variables and made sure to verify and adjust data
types as needed. This process guaranteed that all data points were ready for use and maintained
consistency, for analysis.
A statistical test known as a T test was performed to analyze the price disparity between
properties near water and those that are not. The results showed a T statistic of 0.210 and a p
value of 0.836 suggesting that there is a slight difference in prices, between waterfront and non-
waterfront properties.
To explain the Central Limit Theorem, we took samples from the dataset. Graphed the averages
of these samples. The outcome showed that the distribution of sample averages resembled a
distribution. This proves that as the sample size grows the average price becomes normally
distributed, regardless of whether the original price distribution's normal or not.
Modeling and Analysis
Correlation Analysis
Correlation matrices were computed before and after data preprocessing to understand
relationships between numeric variables. Key correlations identified include:
Heatmaps were used to visualize these correlations, highlighting the relationships between
different property attributes.
Visualizations
• Distribution of Floor Space: This histogram showed the spread and central tendency of
floor space across properties.
• Year Built vs. Price: A scatter plot revealed a positive trend, indicating that newer
properties tend to be priced higher.
• Floor Space vs. Price: A scatter plot demonstrated a clear positive relationship,
suggesting that larger properties command higher prices.
• Waterfront vs. Price: A box plot showed that waterfront properties generally have higher
median prices, though the variability within each category was considerable.
Evaluation
• Waterfront status has a minor impact on Price, as indicated by the T-test results.
These findings suggest that while certain factors like floor space and the number of bedrooms
significantly influence property prices, others like the year built and waterfront status have less
impact.
Recommendations
1. Focus on Floor Space and Amenities: Properties with larger floor space and better
amenities should be priced higher, as these factors significantly influence property prices.
2. Year Built Consideration: While newer properties are slightly more valuable, this factor
is less significant compared to floor space and amenities.
The thorough investigation of the Manchester Housing dataset has given us information,
about the factors affecting property prices. By using the CRISP DM framework, we carefully
studied the data, utilized techniques and drew significant conclusions to guide our strategic
choices.
References
Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.
Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail--but Some Don't.
Penguin.
Appendix
plt.figure(figsize=(10, 6))
sns.histplot(sample_means, kde=True)
plt.title('Sampling Distribution of the Sample Mean [Central Limit Theorem]')
plt.xlabel('Sample Mean of Price')
plt.ylabel('Frequency')
plt.show()