Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
Develop a program to Compute the correlation matrix to understand the relationships between pairs of features. Visualize the correlation matrix
using a heatmap to know which variables have strong positive/negative correlations. Create a pair plot to visualize pairwise relationships between
features. Use California Housing dataset.
Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
In [5]: # Summary Statistics
print("\nSummary Statistics:")
print(df.describe()) # Summary statistics of dataset
Summary Statistics:
MedInc HouseAge AveRooms AveBedrms Population \
count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000
The summary statistics table provides key percentiles and other descriptive metrics for each numerical feature:
- **25% (First Quartile - Q1):** This represents the value below which 25% of the data falls. It helps in understandi
ng the lower bound of typical data values.
- **50% (Median - Q2):** This is the middle value when the data is sorted. It provides the central tendency of the da
taset.
- **75% (Third Quartile - Q3):** This represents the value below which 75% of the data falls. It helps in identifying
the upper bound of typical values in the dataset.
- These percentiles are useful for detecting skewness, data distribution, and identifying potential outliers (values
beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR).
In [6]: # Check for missing values
print("\nMissing Values in Each Column:")
print(df.isnull().sum()) # Count of missing values