0% found this document useful (0 votes)
5 views

4. Data Analytics I

The document outlines a laboratory exercise for a Data Science and Big Data Analytics course, focusing on linear regression using the California housing dataset. It details steps including loading the dataset, exploratory data analysis, data preprocessing, splitting the data, feature scaling, training the model, and evaluating its performance. The evaluation metrics include Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and R-Squared.

Uploaded by

Chirag Patekar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

4. Data Analytics I

The document outlines a laboratory exercise for a Data Science and Big Data Analytics course, focusing on linear regression using the California housing dataset. It details steps including loading the dataset, exploratory data analysis, data preprocessing, splitting the data, feature scaling, training the model, and evaluating its performance. The evaluation metrics include Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, and R-Squared.

Uploaded by

Chirag Patekar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Third Year Engineering (2019 Pattern)

Course Code: 310256


Course Name: Data Science and Big Data Analytics Laboratory
Group A
4) Data Analytics I
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error,
r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing

# Load the dataset (using sklearn since Kaggle may require API authentication)
boston = fetch_california_housing()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['PRICE'] = boston.target # Target variable

# Step 2: Exploratory Data Analysis


print("\nDataset Information:")
print(df.info())
print("\nDataset Summary Statistics:")
print(df.describe())
# Step 3: Data Preprocessing
# Checking for missing values
print("\nMissing Values in Dataset:")
print(df.isnull().sum())

# Step 4: Splitting Data into Training and Testing Sets


X = df.drop(columns=['PRICE']) # Features
y = df['PRICE'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Standardizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 5: Train Linear Regression Model


model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Step 6: Evaluate Model Performance


y_pred = model.predict(X_test_scaled)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"\nModel Performance:")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-Squared (R²): {r2}")

Explanation of Each Step:

1. Loading the Dataset


o Used sklearn.datasets. fetch_california_housing() to get the Boston
Housing data.
o Converted it into a Pandas DataFrame.
2. Exploratory Data Analysis (EDA)
o Displayed dataset info and summary statistics using .info()
and .describe().
3. Data Preprocessing
o Checked for missing values using .isnull().sum().
4. Splitting the Dataset
o Split the data into 80% training and 20% testing using
train_test_split().
5. Feature Scaling
o Standardized the data using StandardScaler().
6. Training the Linear Regression Model
o Fit a LinearRegression() model to the training data.
7. Model Evaluation
OUTPUT-

You might also like