0% found this document useful (0 votes)
51 views3 pages

Mlprogram 1

The document outlines a program to analyze the California Housing dataset by creating histograms and box plots for all numerical features to assess their distributions and identify outliers. It utilizes libraries such as pandas, seaborn, and matplotlib for data visualization and employs the IQR method to detect outliers. Additionally, it provides a summary of the dataset's statistics.

Uploaded by

Rana Manal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views3 pages

Mlprogram 1

The document outlines a program to analyze the California Housing dataset by creating histograms and box plots for all numerical features to assess their distributions and identify outliers. It utilizes libraries such as pandas, seaborn, and matplotlib for data visualization and employs the IQR method to detect outliers. Additionally, it provides a summary of the dataset's statistics.

Uploaded by

Rana Manal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

ML Program1

1. Develop a program to create histograms for all numerical features and analyse the distribution
of each feature. Generate box plots for all numerical features and identify any outliers. Use
California Housing dataset.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Step 1: Load the California Housing dataset


data = fetch_california_housing(as_frame=True)
housing_df = data.frame
# Step 2: Create histograms for numerical features
numerical_features = housing_df.select_dtypes(include=[np.number]).columns
# Plot histograms
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1) # Correct indentation
sns.histplot(housing_df[feature], kde=True, bins=30, color='blue')
plt.title(f'Distribution of {feature}')
plt.tight_layout() # Properly place this outside the loop
plt.show()
# Step 3: Generate box plots for numerical features
# Plot box plots
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.boxplot(x=housing_df[feature], color='orange')
plt.title(f'Box Plot of {feature}')
plt.tight_layout()
plt.show()
# Step 4: Identify outliers using the IQR method
print("Outliers Detection:")
outliers_summary = {}
for feature in numerical_features:
Q1 = housing_df[feature].quantile(0.25)
Q3 = housing_df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = housing_df[(housing_df[feature] < lower_bound) | (housing_df[feature] >
upper_bound)]
outliers_summary[feature] = len(outliers)
print(f"{feature}: {len(outliers)} outliers")
# Optional: Print a summary of the dataset
print("\nDataset Summary:")
print(housing_df.describe())

output:

You might also like