0% found this document useful (0 votes)
2 views8 pages

2 Program

The document outlines an experiment to compute and visualize the correlation matrix using the California Housing dataset, emphasizing the importance of understanding feature relationships in data analysis. It details the creation of a heatmap to represent correlations and a pair plot for visualizing pairwise relationships, highlighting the significance of these techniques for feature selection and multicollinearity detection. Key insights from the data exploration include the absence of missing values, the presence of skewed distributions, and the identification of potential outliers.

Uploaded by

1bi22cd016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

2 Program

The document outlines an experiment to compute and visualize the correlation matrix using the California Housing dataset, emphasizing the importance of understanding feature relationships in data analysis. It details the creation of a heatmap to represent correlations and a pair plot for visualizing pairwise relationships, highlighting the significance of these techniques for feature selection and multicollinearity detection. Key insights from the data exploration include the absence of missing values, the presence of skewed distributions, and the identification of potential outliers.

Uploaded by

1bi22cd016
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Experiment 2

Develop a program to Compute the correlation matrix


to understand the relationships between pairs of
features.
Visualize the correlation matrix using a heatmap
to know which variables have strong
positive/negative correlations. Create a pair plot
to visualize pairwise relationships between
features. Use California Housing dataset.

Introduction
In data analysis and machine learning, understanding the relationships
between features is crucial for feature selection, multicollinearity detection,
and data interpretation. Correlation and pair plots are two essential techniques
to analyze these relationships.

1.Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables. It
helps in understanding how strongly features are related to each other.

Types of Correlation
Positive Correlation (+1 to 0): As one feature increases, the other also
increases.
Negative Correlation (0 to -1): As one feature increases, the other decreases.
No Correlation (0): No linear relationship between the variables.

Why Should You Use a Correlation Matrix?


Identifies relationships between features.
Helps in detecting multicollinearity in machine learning models.
Highlights redundant features that may not add value to the model.

2.Heatmap for Correlation Matrix


A heatmap is a visual representation of the correlation matrix. It uses color
coding to indicate the strength of relationships between variables.

 sns.heatmap() creates a heatmap visualization.


 corr_matrix is passed as the data to be visualized.
 annot=True displays the correlation values inside each cell.
 cmap='coolwarm' sets the color scheme:
o Blue → Positive correlation.
o Red → Negative correlation.
 fmt='.2f' ensures values are displayed with two decimal places.
Benefits of Using a Heatmap
Easy to interpret relationships between
features. Quickly identifies highly
correlated variables.
Helps in feature selection and data preprocessing.

A correlation heatmap is used to visualize the relationship between numerical features in a dataset. It
displays:

 Positive correlations (values close to +1 in blue shades).


 Negative correlations (values close to -1 in red shades).
 Weak or no correlation (values around 0 in neutral colors like white or gray).

Each cell in the heatmap represents the correlation coefficient (r-value) between two features.
Correlation (r) Meaning
+1.0 Perfect positive correlation (as X increases, Y increases)
+0.5 Moderate positive correlation
0.0 No correlation (X and Y are independent)
-0.5 Moderate negative correlation
-1.0 Perfect negative correlation (as X increases, Y decreases)
3.Pair Plot
A pair plot (also known as a scatterplot matrix) is a collection of scatter plots
for every pair of numerical variables in the dataset. It helps in visualizing
relationships between variables.

Why Use a Pair Plot?


Shows the distribution of individual features along the
diagonal. Displays relationships between features using
scatter plots.
Helps in identifying clusters, trends, and potential outliers.

In [13]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Load California Housing dataset


data = fetch_california_housing()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target # Adding the target variable (median house
value)
In [11]: # Table of Meaning of Each Variable
variable_meaning = {
"MedInc": "Median income in block group",
"HouseAge": "Median house age in block group",
"AveRooms": "Average number of rooms per household",
"AveBedrms": "Average number of bedrooms per
household", "Population": "Population of block
group",
"AveOccup": "Average number of household members",
"Latitude": "Latitude of block group",
"Longitude": "Longitude of block group",
"Target": "Median house value (in $100,000s)"
}
variable_df = pd.DataFrame(list(variable_meaning.items()),
columns=["Feature", "Description"])
print("\nVariable Meaning Table:")
print(variable_df)
Variable Meaning Table:
Feature Description
0 MedInc Median income in block group
1 HouseAge Median house age in block group
2 AveRooms Average number of rooms per household
3 AveBedrms Average number of bedrooms per household
4 Population Population of block group
5 AveOccup Average number of household members
6 Latitude Latitude of block group
7 Longitude Longitude of block group
8 Target Median house value (in $100,000s)
In [4]: # Basic Data Exploration
print("\nBasic Information about
Dataset:") print(df.info()) # Overview of
dataset
print("\nFirst Five Rows of Dataset:")
print(df.head()) # Display first few rows
Basic Information about Dataset:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to
20639 Data columns (total 9
columns):
# Column Non-Null Count Dtype

0 MedInc 20640 non-null float64


1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 Target 20640 non-null
float64 dtypes: float64(9)
memory usage: 1.4 MB
None

First Five Rows of Dataset:


MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

In [5]: # Summary Statistics


print("\nSummary Statistics:")
print(df.describe()) # Summary statistics of dataset
Summary
Statistics: HouseAge AveRooms AveBedrms Population \
MedInc
count 20640.000000 20640.000000 20640.000000 20640.000000
20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000

AveOccup Latitude Longitude Target


count 20640.000000 20640.000000 20640.000000
20640.000000
mean 3.070655 35.631861 -119.569704 2.068558
std 10.386050 2.135952 2.003532 1.153956
min 0.692308 32.540000 -124.350000 0.149990
25% 2.429741 33.930000 -121.800000 1.196000
50% 2.818116 34.260000 -118.490000 1.797000
75% 3.282261 37.710000 -118.010000 2.647250
max 41.950000 -114.310000 5.000010

1243.333333

In [12]: # Explanation of Summary Statistics


summary_explanation = """
The summary statistics table provides key percentiles and other descriptive
metrics
- **25% (First Quartile - Q1):** This represents the value below which 25%
of the d
- **50% (Median - Q2):** This is the middle value when the data is sorted.
It provi
- **75% (Third Quartile - Q3):** This represents the value below which 75%
of the d
- These percentiles are useful for detecting skewness, data distribution, and
Summary Statistics Explanation:

The summary statistics table provides key percentiles and other descriptive
metrics for each numerical feature:
- **25% (First Quartile - Q1):** This represents the value below which 25% of
the da ta falls. It helps in understanding the lower bound of typical data
values.
- **50% (Median - Q2):** This is the middle value when the data is sorted. It
provid es the central tendency of the dataset.
- **75% (Third Quartile - Q3):** This represents the value below which 75% of
the da ta falls. It helps in identifying the upper bound of typical values
in the dataset.
- These percentiles are useful for detecting skewness, data distribution, and
identi fying potential outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR).

In [6]: # Check for missing values


print("\nMissing Values in Each Column:")
print(df.isnull().sum()) # Count of missing values
Missing Values in Each
Column: MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
Target 0
dtype: int64

In [7]: # Histograms for distribution of features


plt.figure(figsize=(12, 8))
df.hist(figsize=(12, 8), bins=30, edgecolor='black')
plt.suptitle("Feature Distributions", fontsize=16)
plt.show()

<Figure size 864x576 with 0 Axes>

In [8]: # Boxplots for outlier detection


plt.figure(figsize=(12, 6))
sns.boxplot(data=df)
plt.xticks(rotation=45)
plt.title("Boxplots of Features to Identify Outliers")
plt.show()
In [9]: # Correlation Matrix
plt.figure(figsize=(10, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

In [10]: # Pairplot to analyze feature relationships (only a subset for clarity)


sns.pairplot(df[['MedInc', 'HouseAge', 'AveRooms', 'Target']],
diag_kind='kde')
plt.show()

# Insights from Data Exploration


print("\nKey Insights:")
print("1. The dataset has", df.shape[0], "rows and", df.shape[1], "columns.")
print("2. No missing values were found in the dataset.")
print("3. Histograms show skewed distributions in some features like
'MedInc'.") print("4. Boxplots indicate potential outliers in 'AveRooms'
and 'AveOccup'.")
print("5. Correlation heatmap shows 'MedInc' has the highest correlation

Key Insights:
1. The dataset has 20640 rows and 9 columns.
2. No missing values were found in the dataset.
3. Histograms show skewed distributions in some features like
'MedInc'.
4. Boxplots indicate potential outliers in 'AveRooms' and 'AveOccup'.
5. Correlation heatmap shows 'MedInc' has the highest correlation
with house prices.

You might also like