0% found this document useful (0 votes)
9 views13 pages

SNT 7

The document outlines a linear regression analysis assignment using the Boston Housing Dataset to predict house prices based on independent variables such as the number of rooms, lower status population percentage, and pupil-teacher ratio. It details the steps for data preprocessing, model training using Ordinary Least Squares (OLS) regression, and performance evaluation through metrics like Mean Squared Error (MSE) and R² score. Additionally, it includes code snippets for data handling, exploratory data analysis, and interpretation of the regression results, along with limitations of the model.

Uploaded by

nisargbhatt.n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views13 pages

SNT 7

The document outlines a linear regression analysis assignment using the Boston Housing Dataset to predict house prices based on independent variables such as the number of rooms, lower status population percentage, and pupil-teacher ratio. It details the steps for data preprocessing, model training using Ordinary Least Squares (OLS) regression, and performance evaluation through metrics like Mean Squared Error (MSE) and R² score. Additionally, it includes code snippets for data handling, exploratory data analysis, and interpretation of the regression results, along with limitations of the model.

Uploaded by

nisargbhatt.n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

MA262-SNT D24IT179

Week 7: Linear Regression Analysis

Regression analysis is widely used in industries such as retail, meteorology, and real
estate to predict future trends. In this assignment, you will:

Select a dataset related to .

Preprocess the dataset for missing values and anomalies.

Apply to model the relationship between independent and dependent variables.

Evaluate model performance using a suitable metric

Follow following steps

Load the dataset using Pandas.

Handle missing values and perform data cleaning.

Select relevant independent variables (e.g., advertising budget, location, product


category for sales; humidity, altitude for temperature; house size, number of rooms for
housing prices).

Plot histograms, scatter plots, and correlation heatmaps to identify relationships


between variables.

Normalize or transform data if necessary.

Use from statsmodels to train the regression model.

Split data into training and testing sets.

Fit the model and interpret coefficients.

Use for model evaluation.

Identify key predictors and their impact on the dependent variable.

Analyze residuals to check for model validity.

Submit following

Problem definition and dataset details.

Data Preprocessing

CSPIT-IT 1
MA262-SNT D24IT179

OLS regression implementation and performance analysis.

Model interpretation and limitations.


Jupyter Notebook with well-commented code.

CSPIT-IT 2
MA262-SNT D24IT179

Problem Definition and Dataset Details


 Select a dataset relevant to regression analysis.
 Preprocess the dataset to handle missing values and anomalies.
 Use Ordinary Least Squares (OLS) Regression to find relationships between
variables.
 Evaluate the model using Mean Squared Error (MSE) and R² score.
 Interpret the regression coefficients and model performance

Dataset: Boston Housing Dataset (Predicting house prices)


This dataset contains information about different houses in Boston, including:
 Independent Variables (X):
o RM (Average number of rooms per dwelling)

o LSTAT (Percentage of lower status population)

o PTRATIO (Pupil-Teacher ratio in schools)

 Dependent Variable (y):


o MEDV (Median value of owner-occupied homes in $1000s)

CSPIT-IT 3
MA262-SNT D24IT179

Code : -
# 📌 Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline # Ensures plots are displayed in Jupyter Notebook

# 📌 Load the dataset (Boston Housing Data)


url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
df = pd.read_csv(url)

# 📌 Display first 5 rows of the dataset


print("🔹 Initial 5 rows of the dataset:")
print(df.head())

# 📌 Check for missing values


print("\n🔹 Missing values in the dataset:\n", df.isnull().sum())

# 📌 If any missing values exist, fill them with column mean


df.fillna(df.mean(), inplace=True)

# 📌 Define independent (X) and dependent (y) variables


X = df[['rm', 'lstat', 'ptratio']] # Selecting important independent variables
y = df['medv'] # House price (dependent variable)

# 📌 Display dataset statistics


print("\n🔹 Summary statistics of dataset:")
print(df.describe())

# 📌 Exploratory Data Analysis (EDA)

# 🔹 Histogram of House Prices


plt.figure(figsize=(6,4))
sns.histplot(df['medv'], bins=20, kde=True, color="blue")
plt.title("Distribution of House Prices")
plt.xlabel("Median House Price ($1000s)")
plt.ylabel("Frequency")

CSPIT-IT 4
MA262-SNT D24IT179

plt.show()

# 🔹 Scatter plot of RM (rooms) vs. MEDV (price)


plt.figure(figsize=(6,4))
sns.scatterplot(x=df['rm'], y=df['medv'], color='red')
plt.title("Number of Rooms vs House Price")
plt.xlabel("Number of Rooms")
plt.ylabel("House Price ($1000s)")
plt.show()

# 🔹 Correlation Heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Variables")
plt.show()

# 📌 Splitting Data into Training and Testing Sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 📌 Add constant to independent variables for OLS regression


X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

# 📌 Print dataset shapes to confirm splitting


print("\n🔹 Training Data Shape (X_train):", X_train.shape)
print("🔹 Testing Data Shape (X_test):", X_test.shape)
print("🔹 Training Target Shape (y_train):", y_train.shape)
print("🔹 Testing Target Shape (y_test):", y_test.shape)

# 📌 Print first few rows to confirm constant addition


print("\n🔹 First 5 rows of training data (X_train):")
print(X_train.head())

# 📌 Train the regression model using statsmodels


model = sm.OLS(y_train, X_train).fit()

# 📌 Display the model summary


print("\n🔹 OLS Regression Model Summary:")
print(model.summary())

# 📌 Predicting on Test Data


y_pred = model.predict(X_test)

# 📌 Compute Mean Squared Error (MSE) and R² Score


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

CSPIT-IT 5
MA262-SNT D24IT179

print(f"\n🔹 Model Performance:")


print(f"✅ Mean Squared Error (MSE): {mse:.2f}")
print(f"✅ R² Score: {r2:.2f}")

# 📌 Residual Analysis - Plot Residuals


residuals = y_test - y_pred

plt.figure(figsize=(6,4))
sns.scatterplot(x=y_pred, y=residuals, color='purple')
plt.axhline(y=0, color='red', linestyle='--')
plt.title("Residuals vs Predicted Values")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()

# 📌 Interpretation of Results
interpretation = """
🔹 **Interpretation of Coefficients:**
- `RM (Rooms)` → **Positive Coefficient**: More rooms increase house prices.
- `LSTAT (Lower Status %)` → **Negative Coefficient**: Higher % of lower-status people
reduces house prices.
- `PTRATIO (Pupil-Teacher Ratio)` → **Negative Coefficient**: Higher class sizes
negatively affect house prices.

🔹 **Limitations:**
1. The model assumes **linear relationships**, but real estate pricing may have non-
linear effects.
2. It does not account for **location**, crime rates, or neighborhood factors.
3. The dataset may contain **outliers** affecting predictions.
"""

print("\n🔹 Interpretation of Regression Model:")


print(interpretation)

CSPIT-IT 6
MA262-SNT D24IT179

Output : -

CSPIT-IT 7
MA262-SNT D24IT179

CSPIT-IT 8
MA262-SNT D24IT179

CSPIT-IT 9
MA262-SNT D24IT179

CSPIT-IT 10
MA262-SNT D24IT179

CSPIT-IT 11
MA262-SNT D24IT179

CSPIT-IT 12
MA262-SNT D24IT179

🔹Interpretation of Regression Analysis

- `RM (Rooms)` → **Positive Coefficient**: More rooms increase house prices.

- `LSTAT (Lower Status %)` → **Negative Coefficient**: Higher % of lower-status people reduces
house prices.

- `PTRATIO (Pupil-Teacher Ratio)` → **Negative Coefficient**: Higher class sizes negatively affect
house prices.

🔹 **Limitations:**

1. The model assumes **linear relationships**, but real estate pricing may have non-linear effects.

2. It does not account for **location**, crime rates, or neighborhood factors.

3. The dataset may contain **outliers** affecting predictions.

CSPIT-IT 13

You might also like