House Prices Analysis - Final Assessment
House Prices Analysis - Final Assessment
Test and Train dataset both contain around 1460 data points. For ML tasks merge the data and then
start
Scenario:
You are working with a real estate dataset that contains various features such as:
Neighborhood: Categorical feature indicating the area where the house is located.
1. Load the dataset into a Pandas DataFrame and display its first five rows.
2. Check for missing values and decide how to handle them (drop, fill, or impute).
3. Find duplicate rows and remove them if any exist.
4. Describe the distribution of SalePrice using summary statistics
5. Calculate the correlation between SalePrice and other numerical features. Which feature
has the highest correlation with price?
1. Find all houses built before 1980 that have at least 3 bedrooms and 2 bathrooms.
2. Identify the top 5 most expensive neighborhoods based on average SalePrice.
3. List all houses that have a GarageSize of at least 2 and a LotSize greater than 5000 sqft.
4. Find houses where the price per square foot (SalePrice / SquareFootage) is greater than
$200.
5. Identify houses with the smallest and largest lot sizes in the dataset.
6. Find the most common number of bedrooms in houses that sold for more than $500,000.
Visualization Tasks (using train dataset to build model)
1. Split the dataset into training (80%) and testing (20%) sets.
2. Train a Linear Regression model to predict SalePrice using the given features.
3. Evaluate the model using RMSE (Root Mean Squared Error) and R² score.
4. Train a Decision Tree Regressor and compare its performance with Linear Regression.
5. Train a Random Forest Regressor and check if it improves accuracy.
6. Perform Hyperparameter tuning for the Random Forest model
7. Compare feature importances from the Random Forest model to understand which
features influence price the most.
8. Use cross-validation to validate the performance of your models.
9. Make a price prediction for a new house with given features (e.g., 4 bedrooms, 2
bathrooms, 2500 sqft, built in 2005, in a certain neighbourhood).
10. Use K-Means clustering to group similar houses and analyze price variations within
clusters.
11. Perform time-series analysis on housing price trends if the dataset contains historical
sales data.
12. Use PCA (Principal Component Analysis) to reduce dimensionality and improve model
performance.
13. Train a Neural Network model using TensorFlow/Keras for price prediction.
14. Use SHAP (SHapley Additive exPlanations) to explain model predictions.