0% found this document useful (0 votes)
2 views15 pages

Task 1– Data Analytics in Python

This report investigates the Manchester Housing dataset to identify factors influencing property values using the CRISP DM framework. Key findings indicate that floor space and the number of bedrooms significantly impact prices, while waterfront status has a minor effect. The analysis recommends focusing on larger properties with better amenities for pricing strategies and investment decisions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views15 pages

Task 1– Data Analytics in Python

This report investigates the Manchester Housing dataset to identify factors influencing property values using the CRISP DM framework. Key findings indicate that floor space and the number of bedrooms significantly impact prices, while waterfront status has a minor effect. The analysis recommends focusing on larger properties with better amenities for pricing strategies and investment decisions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

COM7024

Msc Data Science

Programming for Data


Analytics

Investigating the Manchester Housing Market

STU218659

Lee Braiden
Investigating the Manchester Housing Market
The main goal of this report is to examine the Manchester Housing dataset and offer
insights to help make informed decisions. The analysis is based on the CRISP DM (Cross
Industry Standard Process, for Data Mining) framework encompassing stages like Business
Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.
Within this report are statistical examinations the application of the Central Limit Theorem and
Python utilization, for data analysis.

Exploring Business Factors

The main objective is to pinpoint the elements that impact property values in Manchester
specifically looking at features, like footage, construction year, proximity to water and available
amenities. This study seeks to provide insights, for pricing tactics, real estate development
choices and potential investment prospects.

Data Understanding

Dataset Overview

The dataset contains various attributes of properties in Manchester, including:

• Price

• Waterfront status

• Floor Space

• Year Built

• Bedrooms

• Bathrooms

• Location

• Property Type

• Condition
• Lot Size

• Amenities

First, we loaded the dataset and displayed the first 10 rows for initial inspection.

Descriptive Statistics

In this study we analyzed the statistics, for waterfront homes to get insights, into their
characteristics and variations. The findings revealed that waterfront properties generally
command prices offer spacious living areas and come with a greater range of amenities
compared to non-waterfront properties.

Data Preparation

Data Cleaning and Transformation

It is important to find and fill in missing values accurately for analysis. We replaced missing
values, with the occurring value for categorical variables and made sure to verify and adjust data
types as needed. This process guaranteed that all data points were ready for use and maintained
consistency, for analysis.

Statistical Test: T-test

A statistical test known as a T test was performed to analyze the price disparity between
properties near water and those that are not. The results showed a T statistic of 0.210 and a p
value of 0.836 suggesting that there is a slight difference in prices, between waterfront and non-
waterfront properties.

Central Limit Theorem Demonstration

To explain the Central Limit Theorem, we took samples from the dataset. Graphed the averages
of these samples. The outcome showed that the distribution of sample averages resembled a
distribution. This proves that as the sample size grows the average price becomes normally
distributed, regardless of whether the original price distribution's normal or not.
Modeling and Analysis

Correlation Analysis

Correlation matrices were computed before and after data preprocessing to understand
relationships between numeric variables. Key correlations identified include:

• A moderate positive correlation (0.390) between Floor Space and Price.

• A minor correlation (0.094) between Year Built and Price.

• A minor correlation (0.045) between Waterfront status and Price.

Heatmaps were used to visualize these correlations, highlighting the relationships between
different property attributes.

Visualizations

Several plots were created to visualize relationships between variables:

• Distribution of Floor Space: This histogram showed the spread and central tendency of
floor space across properties.

• Year Built vs. Price: A scatter plot revealed a positive trend, indicating that newer
properties tend to be priced higher.

• Floor Space vs. Price: A scatter plot demonstrated a clear positive relationship,
suggesting that larger properties command higher prices.

• Waterfront vs. Price: A box plot showed that waterfront properties generally have higher
median prices, though the variability within each category was considerable.

Evaluation

The analysis revealed key insights:

• There is a moderate positive correlation between Floor Space and Price.

• Bedrooms and Bathrooms have strong positive correlations with Price.

• Waterfront status has a minor impact on Price, as indicated by the T-test results.
These findings suggest that while certain factors like floor space and the number of bedrooms
significantly influence property prices, others like the year built and waterfront status have less
impact.

Recommendations

1. Focus on Floor Space and Amenities: Properties with larger floor space and better
amenities should be priced higher, as these factors significantly influence property prices.

2. Year Built Consideration: While newer properties are slightly more valuable, this factor
is less significant compared to floor space and amenities.

3. Investment in Non-Waterfront Properties: Given the minor price difference between


waterfront and non-waterfront properties, investing in well-located non-waterfront
properties with good amenities might be more cost-effective.

The thorough investigation of the Manchester Housing dataset has given us information,
about the factors affecting property prices. By using the CRISP DM framework, we carefully
studied the data, utilized techniques and drew significant conclusions to guide our strategic
choices.
References

Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Elsevier.

McKinney, W. (2010). Data Analysis with Python. O'Reilly Media.

Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail--but Some Don't.
Penguin.
Appendix

# Importing required libraries


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

# Path of Manchester Housing dataset


file_path = r'C:\Users\Administrator\Desktop\DataAnalytics\manchester_housing_data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset


print("First 10 rows of the dataset:")
print(data.head(10))

# Descriptive statistics for waterfront properties


print("statistics for waterfront properties:")
waterfront_properties = data[data['Waterfront'] == 1]
print(waterfront_properties.describe())

# Graph the distribution of floor space


plt.figure(figsize=(10, 6))
sns.histplot(data['Floor Space'], kde=True)
plt.title('Distribution of Floor Space')
plt.xlabel('Floor Space (sq ft)')
plt.ylabel('Frequency')
plt.show()

# Correlation matrix for numeric columns


print("\nCorrelation matrix for numeric columns:")
numeric_cols = data.select_dtypes(include=[np.number])
correlation_matrix = numeric_cols.corr()
print(correlation_matrix)

# Visualize the correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Scatter plot for Year Built vs. Price


plt.figure(figsize=(10, 6))
sns.scatterplot(x='Year Built', y='Price', data=data)
plt.title('Year Built vs. Price')
plt.xlabel('Year Built')
plt.ylabel('Price')
plt.show()
# Scatter plot for Floor Space vs. Price
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Floor Space', y='Price', data=data)
plt.title('Floor Space vs. Price')
plt.xlabel('Floor Space (sq ft)')
plt.ylabel('Price')
plt.show()

# Box plot for Waterfront vs. Price


plt.figure(figsize=(10, 6))
sns.boxplot(x='Waterfront', y='Price', data=data)
plt.title('Waterfront vs. Price')
plt.xlabel('Waterfront')
plt.ylabel('Price')
plt.show()

# Correlation between Floor Space and Price


correlation_floor_space_price = data['Floor Space'].corr(data['Price'])
print(f"\nCorrelation between Floor Space and Price:
{correlation_floor_space_price:.3f}")

# Correlation between Year Built and Price


correlation_year_price = data['Year Built'].corr(data['Price'])
print(f"Correlation between Year Built and Price: {correlation_year_price:.3f}")

# Central Limit Theorem


sample_means = []
for _ in range(1000):
sample = data['Price'].sample(30, replace=True)
sample_means.append(sample.mean())

plt.figure(figsize=(10, 6))
sns.histplot(sample_means, kde=True)
plt.title('Sampling Distribution of the Sample Mean [Central Limit Theorem]')
plt.xlabel('Sample Mean of Price')
plt.ylabel('Frequency')
plt.show()

# T-test (Statistical test) to compare prices of waterfront vs. non-waterfront properties


print("\nPerforming T-test to compare prices of waterfront vs. non-waterfront
properties:")
waterfront_prices = data[data['Waterfront'] == 1]['Price']
non_waterfront_prices = data[data['Waterfront'] == 0]['Price']

t_stat, p_val = stats.ttest_ind(waterfront_prices, non_waterfront_prices)


print(f"Results: t-statistic = {t_stat:.3f}, p-value = {p_val:.3f}")
# Identifying missing values in data
print("\nIdentifying missing values in the dataset:")
missing_values = data.isnull().sum()
print("Missing Values in Dataset:\n", missing_values)
# Impute missing values
data['Amenities'] = data['Amenities'].fillna(data['Amenities'].mode()[0])

# Checking data types and converting them if necessary


data['Price'] = data['Price'].astype(float)
data['Waterfront'] = data['Waterfront'].astype(int)
data['Floor Space'] = data['Floor Space'].astype(float)
data['Year Built'] = data['Year Built'].astype(int)

# To use only numeric columns for correlation


numeric_cols_post = data.select_dtypes(include=[np.number])
correlation_matrix_post = numeric_cols_post.corr()
print("\nCorrelation Matrix after Preprocessing:\n", correlation_matrix_post)

# Visualize the updated correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_post, annot=True, cmap='coolwarm')
plt.title('Updated Correlation Matrix')
plt.show()

Output (in sequence)

You might also like