0% found this document useful (0 votes)
2 views

Assignment1

The document outlines a machine learning workflow for predicting housing prices using a dataset with 10 features. It includes data preprocessing steps such as handling missing values, one-hot encoding categorical variables, feature engineering, and outlier removal, followed by splitting the data into training and test sets. Finally, a linear regression model is trained on polynomial features, and the mean squared error for both training and test sets is computed.

Uploaded by

Rishabh Awasthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment1

The document outlines a machine learning workflow for predicting housing prices using a dataset with 10 features. It includes data preprocessing steps such as handling missing values, one-hot encoding categorical variables, feature engineering, and outlier removal, followed by splitting the data into training and test sets. Finally, a linear regression model is trained on polynomial features, and the mean squared error for both training and test sets is computed.

Uploaded by

Rishabh Awasthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

20/02/2025, 12:12 Untitled1.

ipynb - Colab

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset


file_path = "housing.csv" # Update this if needed
df = pd.read_csv(file_path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

# Handle missing values in 'total_bedrooms' using median imputation


imputer = SimpleImputer(strategy="median")
df["total_bedrooms"] = imputer.fit_transform(df[["total_bedrooms"]])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20640 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

# One-hot encode 'ocean_proximity'


encoder = OneHotEncoder(sparse_output=False, drop="first") # Fixed parameter
encoded_ocean_proximity = encoder.fit_transform(df[["ocean_proximity"]])

# Convert encoded categories to a DataFrame


encoded_df = pd.DataFrame(encoded_ocean_proximity, columns=encoder.get_feature_names_out())

# Combine numerical and categorical features


df_final = pd.concat([df.drop(columns=["ocean_proximity"]), encoded_df], axis=1)

# Feature Engineering: Creating new features


df_final["rooms_per_household"] = df_final["total_rooms"] / df_final["households"]
df_final["bedrooms_per_room"] = df_final["total_bedrooms"] / df_final["total_rooms"]
df_final["population_per_household"] = df_final["population"] / df_final["households"]

# Feature Selection: Drop weakly correlated features


correlation_matrix = df_final.corr()
correlations = correlation_matrix["median_house_value"].sort_values(ascending=False)
low_corr_features = correlations[abs(correlations) < 0.1].index.tolist()
df_optimized = df_final.drop(columns=low_corr_features)

# Remove outliers: Keep only data within 1.5 * IQR range


Q1 = df optimized.quantile(0.25)
https://colab.research.google.com/drive/1WBO5ExB0N6FFLotiWgTunr-opDpq45OO#scrollTo=N5iBhMIREy_7&printMode=true 1/3
20/02/2025, 12:12 Untitled1.ipynb - Colab
Q1 df_optimized.quantile(0.25)
Q3 = df_optimized.quantile(0.75)
IQR = Q3 - Q1
df_filtered = df_optimized[~((df_optimized < (Q1 - 1.5 * IQR)) | (df_optimized > (Q3 + 1.5 * IQR))).any(axis=1)]

# Separate features and target variable


X_opt = df_filtered.drop(columns=["median_house_value"])
y_opt = df_filtered["median_house_value"]

# Split into training (80%) and test (20%) sets


X_train_opt, X_test_opt, y_train_opt, y_test_opt = train_test_split(X_opt, y_opt, test_size=0.2)
X_train_opt.info(),X_test_opt.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10588 entries, 13355 to 3065
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 10588 non-null float64
1 housing_median_age 10588 non-null float64
2 total_rooms 10588 non-null float64
3 median_income 10588 non-null float64
4 ocean_proximity_INLAND 10588 non-null float64
5 ocean_proximity_NEAR BAY 10588 non-null float64
6 ocean_proximity_NEAR OCEAN 10588 non-null float64
7 rooms_per_household 10588 non-null float64
8 bedrooms_per_room 10588 non-null float64
dtypes: float64(9)
memory usage: 827.2 KB
<class 'pandas.core.frame.DataFrame'>
Index: 2647 entries, 7842 to 2441
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latitude 2647 non-null float64
1 housing_median_age 2647 non-null float64
2 total_rooms 2647 non-null float64
3 median_income 2647 non-null float64
4 ocean_proximity_INLAND 2647 non-null float64
5 ocean_proximity_NEAR BAY 2647 non-null float64
6 ocean_proximity_NEAR OCEAN 2647 non-null float64
7 rooms_per_household 2647 non-null float64
8 bedrooms_per_room 2647 non-null float64
dtypes: float64(9)
memory usage: 206.8 KB
(None, None)

# Standardize numerical features


scaler = StandardScaler()
X_train_scaled_opt = scaler.fit_transform(X_train_opt)
X_test_scaled_opt = scaler.transform(X_test_opt)

# Apply Polynomial Features (degree=2)


poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train_scaled_opt)
X_test_poly = poly.transform(X_test_scaled_opt)

# Train the Linear Regression model on polynomial features


model = LinearRegression()
model.fit(X_train_poly, y_train_opt)

# Make predictions
y_train_poly_pred = model.predict(X_train_poly)
y_test_poly_pred = model.predict(X_test_poly)

# Compute Mean Squared Error (MSE)


train_mse_poly = mean_squared_error(y_train_opt, y_train_poly_pred)
test_mse_poly = mean_squared_error(y_test_opt, y_test_poly_pred)

# Print results
print(f"Training MSE: {train_mse_poly:.2f}")
print(f"Test MSE: {test_mse_poly:.2f}")

Training MSE: 2872509441.01


Test MSE: 2915943708.52

Start coding or generate with AI.

https://colab.research.google.com/drive/1WBO5ExB0N6FFLotiWgTunr-opDpq45OO#scrollTo=N5iBhMIREy_7&printMode=true 2/3
20/02/2025, 12:12 Untitled1.ipynb - Colab

https://colab.research.google.com/drive/1WBO5ExB0N6FFLotiWgTunr-opDpq45OO#scrollTo=N5iBhMIREy_7&printMode=true 3/3

You might also like