0% found this document useful (0 votes)
2 views

least square method

The document outlines a Python script for analyzing the California Housing dataset using libraries such as pandas, numpy, and scikit-learn. It includes steps for exploratory data analysis, data visualization, and the implementation of a linear regression model to predict median house values. The script also evaluates the model's performance using mean squared error and R-squared metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

least square method

The document outlines a Python script for analyzing the California Housing dataset using libraries such as pandas, numpy, and scikit-learn. It includes steps for exploratory data analysis, data visualization, and the implementation of a linear regression model to predict median house values. The script also evaluates the model's performance using mean squared error and R-squared metrics.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the California Housing dataset


housing_data = fetch_california_housing()
X = housing_data.data # Features
y = housing_data.target # Target variable (median house value)
feature_names = housing_data.feature_names

# Create a DataFrame from the data and feature names


df = pd.DataFrame(X, columns=feature_names)
df['Target'] = y

# Perform Basic EDA(Exploratory Data Analysis)


# Display the first few rows
print("First few rows of the dataset:")
print(df.head())

# Display summary statistics


print("\nSummary statistics:")
print(df.describe())

# Check for missing values


print("\nMissing values:")
print(df.isnull().sum())

# Data types of each column


print("\nData types:")
print(df.dtypes)

# Histograms of features
df.hist(figsize=(12, 10), bins=20)
plt.suptitle('Histogram of Features')
plt.show()

# Scatter plot of a feature vs. target


feature = 'MedInc' # Choose 'MedInc' (Median Income) as an example feature
32
B.Tech / M.Tech (Integrated) Programmes-Regulations 2021-Volume-11-CSE-Higher Semester Syllabi-Control Copy
plt.figure(figsize=(8, 6))
plt.scatter(df[feature], df['Target'], alpha=0.5)
plt.title(f'Scatter Plot: {feature} vs. Target')
plt.xlabel(feature)
plt.ylabel('Target (Median House Value)')
plt.grid(True)
plt.show()

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model


model = LinearRegression()

# Train the model on the training data


model.fit(X_train, y_train)

# Make predictions on the test data


y_pred = model.predict(X_test)

# Evaluate the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the results


print(f"\nMean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Plot the regression line


plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], y_test, color='blue', label='Actual')
plt.plot(X_test[:, 0], y_pred, color='red', linewidth=2, label='Predicted')
plt.title('Regression Line (Feature: MedInc)')
plt.xlabel('MedInc')
plt.ylabel('Target (Median House Value)')
plt.legend()
plt.grid(True)
plt.show()

32
B.Tech / M.Tech (Integrated) Programmes-Regulations 2021-Volume-11-CSE-Higher Semester Syllabi-Control Copy

You might also like