Regression Dataset
Regression Dataset
Link: https://www.kaggle.com/datasets/camnugent/california-housing-prices/data
Features:
1. Longitude: A measure of how far west a house is; a higher value is farther west
2. Latitude: A measure of how far north a house is; a higher value is farther north
3. HousingMedianAge: Median age of a house within a block; a lower number is a newer building
7. Households: Total number of households, a group of people residing within a home unit, for a block
8. MedianIncome: Median income for households within a block of houses (measured in tens of
thousands of US Dollars)
9. MedianHouseValue: Median house value for households within a block (measured in US Dollars)
Missing Values: There are 207 missing values in total_bedrooms, which is about 1% of the total dataset.
The relatively small proportion of missing values (1%) means that their impact on the analysis is minimal,
but they still need to be addressed to ensure data integrity.
Problem: Problem type is regression because the goal is to predict a continuous target variable, which in
this case is the house price. Regression analysis is used when the outcome or dependent variable is
continuous and we want to understand the relationship between the dependent variable and one or
more independent variables.
Objective: The objective is to predict future house prices based on various features such as location, size,
age, and socio-economic factors. This involves estimating the price as a function of these variables.
- Applications:
- Real Estate Market Analysis: Predicting house prices for investment and market trend analysis.
- Urban Planning: Assisting city planners in understanding the distribution of housing prices.
- Financial Services: Helping mortgage lenders in risk assessment and pricing.
- Imputation: The median value of total_bedrooms is used to fill in the 207 missing values. The
median is chosen because it is robust to outliers and will not skew the data distribution. Given
that these missing values constitute only about 1% of the total dataset, their imputation is
unlikely to significantly affect the overall data quality.
Normalization:
Outlier Detection:
- Method: Z-score or IQR (Interquartile Range) method to detect and possibly remove outliers.
- Justification: Standardizes the data by subtracting the mean and dividing by the standard deviation,
and identifies outliers as points with a Z-score above a certain threshold (e.g., 3 or -3). Outliers can
skew the results of regression models. Removing them ensures a more robust model.
Feature Engineering:
Linear Regression:
- Justification: Provides a baseline model and is interpretable. It helps in understanding the linear
relationship between the features and the target variable.
- Justification: Effective for capturing non-linear relationships and interactions between features. It builds
an ensemble of weak learners to create a strong predictive model.
- Justification: Robust to overfitting due to its ensemble nature and can handle a large number of
features. It also provides feature importance which helps in understanding the impact of each feature.
Data Splitting:
Cross-Validation:
- Method: 10-fold cross-validation. This involves dividing the training data into 10 subsets, training
the model on 9 subsets, and validating it on the remaining subset. This process is repeated 10
times, each time with a different subset as the validation set.
- Justification: Cross-validation helps in assessing the model's performance and ensuring that it
generalizes well to unseen data.
Evaluation Metrics:
- Mean Squared Error (MSE): Measures the average squared difference between predicted and
actual values.
- Root Mean Squared Error (RMSE): The square root of MSE, providing a measure in the same units
as the target variable.
- R-squared: Indicates the proportion of the variance in the dependent variable that is predictable
from the independent variables.