0% found this document useful (0 votes)
25 views

Regression Dataset

Uploaded by

minkssmia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Regression Dataset

Uploaded by

minkssmia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Regression Dataset: California Housing Prices

1. Description of the Dataset

Data Source: Kaggle

Link: https://www.kaggle.com/datasets/camnugent/california-housing-prices/data

Features:

1. Longitude: A measure of how far west a house is; a higher value is farther west

2. Latitude: A measure of how far north a house is; a higher value is farther north

3. HousingMedianAge: Median age of a house within a block; a lower number is a newer building

4. Total Rooms: Total number of rooms within a block

5. Total Bedrooms: Total number of bedrooms within a block

6. Population: Total number of people residing within a block

7. Households: Total number of households, a group of people residing within a home unit, for a block

8. MedianIncome: Median income for households within a block of houses (measured in tens of
thousands of US Dollars)

9. MedianHouseValue: Median house value for households within a block (measured in US Dollars)

Missing Values: There are 207 missing values in total_bedrooms, which is about 1% of the total dataset.
The relatively small proportion of missing values (1%) means that their impact on the analysis is minimal,
but they still need to be addressed to ensure data integrity.

2. Problem Type and Applications

- Problem Type: Regression

Problem: Problem type is regression because the goal is to predict a continuous target variable, which in
this case is the house price. Regression analysis is used when the outcome or dependent variable is
continuous and we want to understand the relationship between the dependent variable and one or
more independent variables.

Objective: The objective is to predict future house prices based on various features such as location, size,
age, and socio-economic factors. This involves estimating the price as a function of these variables.

- Applications:

- Real Estate Market Analysis: Predicting house prices for investment and market trend analysis.
- Urban Planning: Assisting city planners in understanding the distribution of housing prices.
- Financial Services: Helping mortgage lenders in risk assessment and pricing.

3. Data Preprocessing Steps


- Handling Missing Values
- Normalization
- Outlier Detection
- Feature Engineering

Handling Missing Values:

- Imputation: The median value of total_bedrooms is used to fill in the 207 missing values. The
median is chosen because it is robust to outliers and will not skew the data distribution. Given
that these missing values constitute only about 1% of the total dataset, their imputation is
unlikely to significantly affect the overall data quality.

Normalization:

- Technique: Min-Max scaling to transform features to the range [0, 1].


- Justification: Normalization helps in reducing the effect of different scales, which is important for
algorithms like Gradient Boosting that are sensitive to feature scales.

Outlier Detection:

- Method: Z-score or IQR (Interquartile Range) method to detect and possibly remove outliers.
- Justification: Standardizes the data by subtracting the mean and dividing by the standard deviation,
and identifies outliers as points with a Z-score above a certain threshold (e.g., 3 or -3). Outliers can
skew the results of regression models. Removing them ensures a more robust model.

Feature Engineering:

- Creating New Features: For example, `rooms_per_household`, `bedrooms_per_room`, and


`population_per_household`.
- Justification: Derived Features: Creating features such as rooms_per_household,
bedrooms_per_room, and population_per_household can help in better capturing the relationship
between the existing features and the target variable. For example, rooms_per_household can
provide insights into the average size of houses, which might be a strong predictor of house prices.

4. Machine Learning Techniques

Linear Regression:
- Justification: Provides a baseline model and is interpretable. It helps in understanding the linear
relationship between the features and the target variable.

Gradient Boosting Regression:

- Justification: Effective for capturing non-linear relationships and interactions between features. It builds
an ensemble of weak learners to create a strong predictive model.

Random Forest Regression:

- Justification: Robust to overfitting due to its ensemble nature and can handle a large number of
features. It also provides feature importance which helps in understanding the impact of each feature.

5. Training, Validation, and Testing

Data Splitting:

- Training Set: 70% of the data.


- Validation Set: 15% of the data, used for hyperparameter tuning.
- Test Set: 15% of the data, used for final evaluation.

Cross-Validation:

- Method: 10-fold cross-validation. This involves dividing the training data into 10 subsets, training
the model on 9 subsets, and validating it on the remaining subset. This process is repeated 10
times, each time with a different subset as the validation set.
- Justification: Cross-validation helps in assessing the model's performance and ensuring that it
generalizes well to unseen data.

Evaluation Metrics:

- Mean Squared Error (MSE): Measures the average squared difference between predicted and
actual values.
- Root Mean Squared Error (RMSE): The square root of MSE, providing a measure in the same units
as the target variable.
- R-squared: Indicates the proportion of the variance in the dependent variable that is predictable
from the independent variables.

You might also like