0% found this document useful (0 votes)
3 views2 pages

House Prices Analysis - Final Assessment

The document outlines a comprehensive approach to analyzing a real estate dataset with around 1460 data points, focusing on data manipulation, querying, visualization, and machine learning for price prediction. Key tasks include data preprocessing, feature engineering, and building various models such as Linear Regression, Decision Tree, and Random Forest to predict house prices. Additionally, it suggests using techniques like K-Means clustering, PCA, and SHAP for deeper insights and improved model performance.

Uploaded by

i23nikunjm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views2 pages

House Prices Analysis - Final Assessment

The document outlines a comprehensive approach to analyzing a real estate dataset with around 1460 data points, focusing on data manipulation, querying, visualization, and machine learning for price prediction. Key tasks include data preprocessing, feature engineering, and building various models such as Linear Regression, Decision Tree, and Random Forest to predict house prices. Additionally, it suggests using techniques like K-Means clustering, PCA, and SHAP for deeper insights and improved model performance.

Uploaded by

i23nikunjm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

House Prices Dataset -Data Manipulation, Querying, Visualization and Additional

Test and Train dataset both contain around 1460 data points. For ML tasks merge the data and then
start

Scenario:

You are working with a real estate dataset that contains various features such as:

Lot Area: Size of the lot in square feet.

BedroomAbvGr: Total number of bedrooms in the house above grade

NumberOfBathrooms: Total number of bathrooms. (Sum of BsmtFullBath,BsmtHalfBath, FullBath,


HalfBath)

YearBuilt: Year the house was built.

Neighborhood: Categorical feature indicating the area where the house is located.

GarageCars: Number of cars the garage can hold.

SalePrice: The actual price at which the house was sold

Data Preprocessing & Exploration (using train dataset to build model)

1. Load the dataset into a Pandas DataFrame and display its first five rows.
2. Check for missing values and decide how to handle them (drop, fill, or impute).
3. Find duplicate rows and remove them if any exist.
4. Describe the distribution of SalePrice using summary statistics
5. Calculate the correlation between SalePrice and other numerical features. Which feature
has the highest correlation with price?

Feature Engineering (using train dataset to build model)

1. Create a new feature "AgeOfHouse" as (Current Year - YearBuilt).


2. Convert the Neighborhood feature into numerical values using one-hot encoding.
3. Create an interaction feature between SquareFootage and NumberOfBedrooms to analyze
their combined effect on price.
4. Bucket houses into price ranges (Low, Medium, High) based on SalePrice percentiles.

Data Querying & Filtering (using train dataset to build model)

1. Find all houses built before 1980 that have at least 3 bedrooms and 2 bathrooms.
2. Identify the top 5 most expensive neighborhoods based on average SalePrice.
3. List all houses that have a GarageSize of at least 2 and a LotSize greater than 5000 sqft.
4. Find houses where the price per square foot (SalePrice / SquareFootage) is greater than
$200.
5. Identify houses with the smallest and largest lot sizes in the dataset.
6. Find the most common number of bedrooms in houses that sold for more than $500,000.
Visualization Tasks (using train dataset to build model)

1. Describe the distribution of SalePrice using summary statistics and visualizations.


2. Check for outliers in the SalePrice column using a box plot and comment on that
3. Create a histogram to visualize the distribution of SalePrice. Is it normally distributed?
Comment
4. Visualize the relationship between Square Footage and SalePrice using a scatter plot.

Additional (using train and test to build model)

Machine Learning Model for Price Prediction

1. Split the dataset into training (80%) and testing (20%) sets.
2. Train a Linear Regression model to predict SalePrice using the given features.
3. Evaluate the model using RMSE (Root Mean Squared Error) and R² score.
4. Train a Decision Tree Regressor and compare its performance with Linear Regression.
5. Train a Random Forest Regressor and check if it improves accuracy.
6. Perform Hyperparameter tuning for the Random Forest model
7. Compare feature importances from the Random Forest model to understand which
features influence price the most.
8. Use cross-validation to validate the performance of your models.
9. Make a price prediction for a new house with given features (e.g., 4 bedrooms, 2
bathrooms, 2500 sqft, built in 2005, in a certain neighbourhood).
10. Use K-Means clustering to group similar houses and analyze price variations within
clusters.
11. Perform time-series analysis on housing price trends if the dataset contains historical
sales data.
12. Use PCA (Principal Component Analysis) to reduce dimensionality and improve model
performance.
13. Train a Neural Network model using TensorFlow/Keras for price prediction.
14. Use SHAP (SHapley Additive exPlanations) to explain model predictions.

You might also like