Exp 1 A
Exp 1 A
and analyze the distribution of each feature. Generate box plots for all
numerical features and identify any outliers. Use California Housing dataset.
Objective :
This program helps understand the distribution of each numerical feature and identify any potential
outliers in the data set.
Output to be observed :
The analysis for each of the attributes mentioned in the California dataset is analysed for
Python Instructions
# -*- coding: utf-8 -*-
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
OUTPUT
MedInc HouseAge AveRooms ... Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 ... 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 ... 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 ... 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 ... 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 ... 37.85 -122.25 3.422
[5 rows x 9 columns]
Analyzing MedInc:
count 20640.000000
mean 3.870671
std 1.899822
min 0.499900
25% 2.563400
50% 3.534800
75% 4.743250
max 15.000100
Name: MedInc, dtype: float64
Number of outliers in MedInc: 681
Analyzing HouseAge:
count 20640.000000
mean 28.639486
std 12.585558
min 1.000000
25% 18.000000
50% 29.000000
75% 37.000000
max 52.000000
Name: HouseAge, dtype: float64
Number of outliers in HouseAge: 0
Analyzing AveRooms:
count 20640.000000
mean 5.429000
std 2.474173
min 0.846154
25% 4.440716
50% 5.229129
75% 6.052381
max 141.909091
Name: AveRooms, dtype: float64
Number of outliers in AveRooms: 511
Analyzing AveBedrms:
count 20640.000000
mean 1.096675
std 0.473911
min 0.333333
25% 1.006079
50% 1.048780
75% 1.099526
max 34.066667
Name: AveBedrms, dtype: float64
Number of outliers in AveBedrms: 1424
Analyzing Population:
count 20640.000000
mean 1425.476744
std 1132.462122
min 3.000000
25% 787.000000
50% 1166.000000
75% 1725.000000
max 35682.000000
Name: Population, dtype: float64
Number of outliers in Population: 1196
Analyzing AveOccup:
count 20640.000000
mean 3.070655
std 10.386050
min 0.692308
25% 2.429741
50% 2.818116
75% 3.282261
max 1243.333333
Name: AveOccup, dtype: float64
Number of outliers in AveOccup: 711
Analyzing Latitude:
count 20640.000000
mean 35.631861
std 2.135952
min 32.540000
25% 33.930000
50% 34.260000
75% 37.710000
max 41.950000
Name: Latitude, dtype: float64
Number of outliers in Latitude: 0
Analyzing Longitude:
count 20640.000000
mean -119.569704
std 2.003532
min -124.350000
25% -121.800000
50% -118.490000
75% -118.010000
max -114.310000
Name: Longitude, dtype: float64
Number of outliers in Longitude: 0
Analyzing MedHouseVal:
count 20640.000000
mean 2.068558
std 1.153956
min 0.149990
25% 1.196000
50% 1.797000
75% 2.647250
max 5.000010
Name: MedHouseVal, dtype: float64
Number of outliers in MedHouseVal: 1071
The graph sheets is to be attached with appropriate scale
i. Two representations are to be presented on each graph sheet