0% found this document useful (0 votes)
37 views5 pages

Exp 1 A

The document outlines a program to analyze the California Housing dataset by creating histograms and box plots for all numerical features to understand their distributions and identify outliers. It specifies the outputs to include statistical measures such as mean, standard deviation, and the number of outliers for each feature. The program utilizes Python libraries like pandas, matplotlib, and seaborn for data visualization and analysis.

Uploaded by

Shobha Hiremath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views5 pages

Exp 1 A

The document outlines a program to analyze the California Housing dataset by creating histograms and box plots for all numerical features to understand their distributions and identify outliers. It specifies the outputs to include statistical measures such as mean, standard deviation, and the number of outliers for each feature. The program utilizes Python libraries like pandas, matplotlib, and seaborn for data visualization and analysis.

Uploaded by

Shobha Hiremath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Term work 1: Develop a program to create histograms for all numerical features

and analyze the distribution of each feature. Generate box plots for all
numerical features and identify any outliers. Use California Housing dataset.

Objective :

This program helps understand the distribution of each numerical feature and identify any potential
outliers in the data set.

The California Housing dataset is considered as an example.

Output to be observed :

The analysis for each of the attributes mentioned in the California dataset is analysed for

i. Mean of values in each of the attributes in dataset


ii. Standard deviation of each attributes
iii. Number of Values laying 25% of data
iv. Number of Values laying 50% of data
v. The outliers that are less than 25% of the values and more than 75% of values
vi. The histogram graph for the individual feature and analysis

Python Instructions
# -*- coding: utf-8 -*-
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Fetch the California Housing dataset


california_housing = fetch_california_housing(as_frame=True)
data = california_housing.frame
# Display the first few rows of the dataset
print(data.head())
# Create histograms for all numerical features
def create_histograms(data):
data.hist(bins=30, figsize=(20, 15))
plt.suptitle('Histograms of Numerical Features', fontsize=20)
plt.show()

# Create box plots for all numerical features


def create_box_plots(data):
plt.figure(figsize=(20, 15))
for i, column in enumerate(data.columns):
plt.subplot(3, 3, i + 1)
sns.boxplot(y=data[column])
plt.title(f'Box Plot of {column}')
plt.suptitle('Box Plots of Numerical Features', fontsize=20)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
# Analyze the distribution and identify outliers
def analyze_distribution(data):
for column in data.columns:
print (f'\nAnalyzing {column}:')
print(data[column].describe())
q1 = data[column].quantile(0.25)
q3 = data[column].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
print(f'Number of outliers in {column}: {len(outliers)}')

# Execute the functions


create_histograms(data)
create_box_plots(data)
analyze_distribution(data)

OUTPUT
MedInc HouseAge AveRooms ... Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 ... 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 ... 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 ... 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 ... 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 ... 37.85 -122.25 3.422

[5 rows x 9 columns]
Analyzing MedInc:
count 20640.000000
mean 3.870671
std 1.899822
min 0.499900
25% 2.563400
50% 3.534800
75% 4.743250
max 15.000100
Name: MedInc, dtype: float64
Number of outliers in MedInc: 681

Analyzing HouseAge:
count 20640.000000
mean 28.639486
std 12.585558
min 1.000000
25% 18.000000
50% 29.000000
75% 37.000000
max 52.000000
Name: HouseAge, dtype: float64
Number of outliers in HouseAge: 0

Analyzing AveRooms:
count 20640.000000
mean 5.429000
std 2.474173
min 0.846154
25% 4.440716
50% 5.229129
75% 6.052381
max 141.909091
Name: AveRooms, dtype: float64
Number of outliers in AveRooms: 511

Analyzing AveBedrms:
count 20640.000000
mean 1.096675
std 0.473911
min 0.333333
25% 1.006079
50% 1.048780
75% 1.099526
max 34.066667
Name: AveBedrms, dtype: float64
Number of outliers in AveBedrms: 1424

Analyzing Population:
count 20640.000000
mean 1425.476744
std 1132.462122
min 3.000000
25% 787.000000
50% 1166.000000
75% 1725.000000
max 35682.000000
Name: Population, dtype: float64
Number of outliers in Population: 1196

Analyzing AveOccup:
count 20640.000000
mean 3.070655
std 10.386050
min 0.692308
25% 2.429741
50% 2.818116
75% 3.282261
max 1243.333333
Name: AveOccup, dtype: float64
Number of outliers in AveOccup: 711

Analyzing Latitude:
count 20640.000000
mean 35.631861
std 2.135952
min 32.540000
25% 33.930000
50% 34.260000
75% 37.710000
max 41.950000
Name: Latitude, dtype: float64
Number of outliers in Latitude: 0

Analyzing Longitude:
count 20640.000000
mean -119.569704
std 2.003532
min -124.350000
25% -121.800000
50% -118.490000
75% -118.010000
max -114.310000
Name: Longitude, dtype: float64
Number of outliers in Longitude: 0

Analyzing MedHouseVal:
count 20640.000000
mean 2.068558
std 1.153956
min 0.149990
25% 1.196000
50% 1.797000
75% 2.647250
max 5.000010
Name: MedHouseVal, dtype: float64
Number of outliers in MedHouseVal: 1071
The graph sheets is to be attached with appropriate scale
i. Two representations are to be presented on each graph sheet

You might also like