0% found this document useful (0 votes)

2 views8 pages

2 Program

The document outlines an experiment to compute and visualize the correlation matrix using the California Housing dataset, emphasizing the importance of understanding feature relationships in data analysis. It details the creation of a heatmap to represent correlations and a pair plot for visualizing pairwise relationships, highlighting the significance of these techniques for feature selection and multicollinearity detection. Key insights from the data exploration include the absence of missing values, the presence of skewed distributions, and the identification of potential outliers.

Uploaded by

1bi22cd016

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views8 pages

2 Program

Uploaded by

1bi22cd016

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Experiment 2

Develop a program to Compute the correlation matrix

to understand the relationships between pairs of
features.
Visualize the correlation matrix using a heatmap
to know which variables have strong
positive/negative correlations. Create a pair plot
to visualize pairwise relationships between
features. Use California Housing dataset.

Introduction
In data analysis and machine learning, understanding the relationships
between features is crucial for feature selection, multicollinearity detection,
and data interpretation. Correlation and pair plots are two essential techniques
to analyze these relationships.

1.Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables. It
helps in understanding how strongly features are related to each other.

Types of Correlation
Positive Correlation (+1 to 0): As one feature increases, the other also
increases.
Negative Correlation (0 to -1): As one feature increases, the other decreases.
No Correlation (0): No linear relationship between the variables.

Why Should You Use a Correlation Matrix?

Identifies relationships between features.
Helps in detecting multicollinearity in machine learning models.
Highlights redundant features that may not add value to the model.

2.Heatmap for Correlation Matrix

A heatmap is a visual representation of the correlation matrix. It uses color
coding to indicate the strength of relationships between variables.

 sns.heatmap() creates a heatmap visualization.

 corr_matrix is passed as the data to be visualized.
 annot=True displays the correlation values inside each cell.
 cmap='coolwarm' sets the color scheme:
o Blue → Positive correlation.
o Red → Negative correlation.
 fmt='.2f' ensures values are displayed with two decimal places.
Benefits of Using a Heatmap
Easy to interpret relationships between
features. Quickly identifies highly
correlated variables.
Helps in feature selection and data preprocessing.

A correlation heatmap is used to visualize the relationship between numerical features in a dataset. It
displays:

 Positive correlations (values close to +1 in blue shades).

 Negative correlations (values close to -1 in red shades).
 Weak or no correlation (values around 0 in neutral colors like white or gray).

Each cell in the heatmap represents the correlation coefficient (r-value) between two features.
Correlation (r) Meaning
+1.0 Perfect positive correlation (as X increases, Y increases)
+0.5 Moderate positive correlation
0.0 No correlation (X and Y are independent)
-0.5 Moderate negative correlation
-1.0 Perfect negative correlation (as X increases, Y decreases)
3.Pair Plot
A pair plot (also known as a scatterplot matrix) is a collection of scatter plots
for every pair of numerical variables in the dataset. It helps in visualizing
relationships between variables.

Why Use a Pair Plot?

Shows the distribution of individual features along the
diagonal. Displays relationships between features using
scatter plots.
Helps in identifying clusters, trends, and potential outliers.

In [13]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Load California Housing dataset

data = fetch_california_housing()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target # Adding the target variable (median house
value)
In [11]: # Table of Meaning of Each Variable
variable_meaning = {
"MedInc": "Median income in block group",
"HouseAge": "Median house age in block group",
"AveRooms": "Average number of rooms per household",
"AveBedrms": "Average number of bedrooms per
household", "Population": "Population of block
group",
"AveOccup": "Average number of household members",
"Latitude": "Latitude of block group",
"Longitude": "Longitude of block group",
"Target": "Median house value (in $100,000s)"
}
variable_df = pd.DataFrame(list(variable_meaning.items()),
columns=["Feature", "Description"])
print("\nVariable Meaning Table:")
print(variable_df)
Variable Meaning Table:
Feature Description
0 MedInc Median income in block group
1 HouseAge Median house age in block group
2 AveRooms Average number of rooms per household
3 AveBedrms Average number of bedrooms per household
4 Population Population of block group
5 AveOccup Average number of household members
6 Latitude Latitude of block group
7 Longitude Longitude of block group
8 Target Median house value (in $100,000s)
In [4]: # Basic Data Exploration
print("\nBasic Information about
Dataset:") print(df.info()) # Overview of
dataset
print("\nFirst Five Rows of Dataset:")
print(df.head()) # Display first few rows
Basic Information about Dataset:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to
20639 Data columns (total 9
columns):
# Column Non-Null Count Dtype

0 MedInc 20640 non-null float64

1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 Target 20640 non-null
float64 dtypes: float64(9)
memory usage: 1.4 MB
None

First Five Rows of Dataset:

MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85

Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422

In [5]: # Summary Statistics

print("\nSummary Statistics:")
print(df.describe()) # Summary statistics of dataset
Summary
Statistics: HouseAge AveRooms AveBedrms Population \
MedInc
count 20640.000000 20640.000000 20640.000000 20640.000000
20640.000000
mean 3.870671 28.639486 5.429000 1.096675 1425.476744
std 1.899822 12.585558 2.474173 0.473911 1132.462122
min 0.499900 1.000000 0.846154 0.333333 3.000000
25% 2.563400 18.000000 4.440716 1.006079 787.000000
50% 3.534800 29.000000 5.229129 1.048780 1166.000000
75% 4.743250 37.000000 6.052381 1.099526 1725.000000
max 15.000100 52.000000 141.909091 34.066667 35682.000000

AveOccup Latitude Longitude Target

count 20640.000000 20640.000000 20640.000000
20640.000000
mean 3.070655 35.631861 -119.569704 2.068558
std 10.386050 2.135952 2.003532 1.153956
min 0.692308 32.540000 -124.350000 0.149990
25% 2.429741 33.930000 -121.800000 1.196000
50% 2.818116 34.260000 -118.490000 1.797000
75% 3.282261 37.710000 -118.010000 2.647250
max 41.950000 -114.310000 5.000010

1243.333333

In [12]: # Explanation of Summary Statistics

summary_explanation = """
The summary statistics table provides key percentiles and other descriptive
metrics
- **25% (First Quartile - Q1):** This represents the value below which 25%
of the d
- **50% (Median - Q2):** This is the middle value when the data is sorted.
It provi
- **75% (Third Quartile - Q3):** This represents the value below which 75%
of the d
- These percentiles are useful for detecting skewness, data distribution, and
Summary Statistics Explanation:

The summary statistics table provides key percentiles and other descriptive
metrics for each numerical feature:
- **25% (First Quartile - Q1):** This represents the value below which 25% of
the da ta falls. It helps in understanding the lower bound of typical data
values.
- **50% (Median - Q2):** This is the middle value when the data is sorted. It
provid es the central tendency of the dataset.
- **75% (Third Quartile - Q3):** This represents the value below which 75% of
the da ta falls. It helps in identifying the upper bound of typical values
in the dataset.
- These percentiles are useful for detecting skewness, data distribution, and
identi fying potential outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR).

In [6]: # Check for missing values

print("\nMissing Values in Each Column:")
print(df.isnull().sum()) # Count of missing values
Missing Values in Each
Column: MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
Target 0
dtype: int64

In [7]: # Histograms for distribution of features

plt.figure(figsize=(12, 8))
df.hist(figsize=(12, 8), bins=30, edgecolor='black')
plt.suptitle("Feature Distributions", fontsize=16)
plt.show()

<Figure size 864x576 with 0 Axes>

In [8]: # Boxplots for outlier detection

plt.figure(figsize=(12, 6))
sns.boxplot(data=df)
plt.xticks(rotation=45)
plt.title("Boxplots of Features to Identify Outliers")
plt.show()
In [9]: # Correlation Matrix
plt.figure(figsize=(10, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title("Feature Correlation Heatmap")
plt.show()

In [10]: # Pairplot to analyze feature relationships (only a subset for clarity)

sns.pairplot(df[['MedInc', 'HouseAge', 'AveRooms', 'Target']],
diag_kind='kde')
plt.show()

# Insights from Data Exploration

print("\nKey Insights:")
print("1. The dataset has", df.shape[0], "rows and", df.shape[1], "columns.")
print("2. No missing values were found in the dataset.")
print("3. Histograms show skewed distributions in some features like
'MedInc'.") print("4. Boxplots indicate potential outliers in 'AveRooms'
and 'AveOccup'.")
print("5. Correlation heatmap shows 'MedInc' has the highest correlation

Key Insights:
1. The dataset has 20640 rows and 9 columns.
2. No missing values were found in the dataset.
3. Histograms show skewed distributions in some features like
'MedInc'.
4. Boxplots indicate potential outliers in 'AveRooms' and 'AveOccup'.
5. Correlation heatmap shows 'MedInc' has the highest correlation
with house prices.

Assignment2 DataViz
No ratings yet
Assignment2 DataViz
11 pages
Lesson Plan
100% (1)
Lesson Plan
7 pages
Impact of Training en Employee Perf
No ratings yet
Impact of Training en Employee Perf
35 pages
2012 Arizona Cardinals Media Guide
No ratings yet
2012 Arizona Cardinals Media Guide
452 pages
Cessna CJ1 PERFORMANCE
67% (3)
Cessna CJ1 PERFORMANCE
104 pages
2007 Sanjay Prabhakaran
100% (1)
2007 Sanjay Prabhakaran
21 pages
Direct and Indirect Speech For Class 6
25% (4)
Direct and Indirect Speech For Class 6
6 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
ML Lab Program 1& 2
No ratings yet
ML Lab Program 1& 2
6 pages
ML Observation
No ratings yet
ML Observation
29 pages
Exp 2 A
No ratings yet
Exp 2 A
4 pages
Prog 2
No ratings yet
Prog 2
2 pages
Updated 1,2,3, Programs
No ratings yet
Updated 1,2,3, Programs
3 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
33 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
ML Report
No ratings yet
ML Report
12 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
117 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory
No ratings yet
CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory
17 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
ML Lab - Exp1-10
No ratings yet
ML Lab - Exp1-10
4 pages
Unit 3 DS
No ratings yet
Unit 3 DS
30 pages
Exp 12 and 15
No ratings yet
Exp 12 and 15
4 pages
Advanced Plot Types With Seaborn
No ratings yet
Advanced Plot Types With Seaborn
8 pages
West Rox
No ratings yet
West Rox
29 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Unit 2
No ratings yet
Unit 2
78 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Lesson 1 - Data Visualisation
No ratings yet
Lesson 1 - Data Visualisation
35 pages
Analyse
No ratings yet
Analyse
2 pages
Lab 2
No ratings yet
Lab 2
1 page
Exercise3 Solution
No ratings yet
Exercise3 Solution
19 pages
PGM 1
No ratings yet
PGM 1
5 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Muhammad Ali Ansari 24855 A2
No ratings yet
Muhammad Ali Ansari 24855 A2
5 pages
Logistic
No ratings yet
Logistic
5 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
20MIS1025 - Regression - Ipynb - Colaboratory
No ratings yet
20MIS1025 - Regression - Ipynb - Colaboratory
5 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
P04 The Regression Pipeline - Preprocessing Ans
No ratings yet
P04 The Regression Pipeline - Preprocessing Ans
19 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Co 2 Multivariate Analysis
No ratings yet
Co 2 Multivariate Analysis
71 pages
Seaborn
No ratings yet
Seaborn
7 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Data Manipulation With Pandas - Yulei's Sandbox
No ratings yet
Data Manipulation With Pandas - Yulei's Sandbox
18 pages
Pandas
No ratings yet
Pandas
7 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
ML Merged
No ratings yet
ML Merged
28 pages
Week 1 Get Familier With Jupyter Notebook
No ratings yet
Week 1 Get Familier With Jupyter Notebook
4 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Exp 1 A
No ratings yet
Exp 1 A
5 pages
Module 2
No ratings yet
Module 2
20 pages
300+ Python Algorithms: Mastering the Art of Problem-Solving
From Everand
300+ Python Algorithms: Mastering the Art of Problem-Solving
Hernando Abella
5/5 (1)
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
8.program Decisiontree
No ratings yet
8.program Decisiontree
15 pages
Mandatory Employee360 Attributes
No ratings yet
Mandatory Employee360 Attributes
2 pages
4.module 2 Chapter 3
No ratings yet
4.module 2 Chapter 3
58 pages
10.program K Means
No ratings yet
10.program K Means
16 pages
3.program PCA
No ratings yet
3.program PCA
7 pages
Intelligent Agents SL
No ratings yet
Intelligent Agents SL
78 pages
Lesson 1 - Algorithm and Flowcharting
100% (1)
Lesson 1 - Algorithm and Flowcharting
32 pages
First Periodical Test in Math - Grade 7
No ratings yet
First Periodical Test in Math - Grade 7
5 pages
Fuchs Lubritech GMBH - STABYL EOS E 2 - 000000000601079160 - 10!25!2016 - English
No ratings yet
Fuchs Lubritech GMBH - STABYL EOS E 2 - 000000000601079160 - 10!25!2016 - English
9 pages
All Plan Ronchester
No ratings yet
All Plan Ronchester
38 pages
Iti Bcu 480 Rev GB
No ratings yet
Iti Bcu 480 Rev GB
134 pages
Accompanied PDF Stock Market Explained
No ratings yet
Accompanied PDF Stock Market Explained
135 pages
Subburu Resume New 2
No ratings yet
Subburu Resume New 2
5 pages
Canadian Visa Requirements 1. Accomplished IMM5257 Form
50% (2)
Canadian Visa Requirements 1. Accomplished IMM5257 Form
5 pages
Ra92053 A - 2009 06
No ratings yet
Ra92053 A - 2009 06
8 pages
Entso-E CESysSep 210724 02 Final Report 220325
No ratings yet
Entso-E CESysSep 210724 02 Final Report 220325
132 pages
DLP PR2 - 4
No ratings yet
DLP PR2 - 4
6 pages
Introduction To CFD Basics Rajesh Bhaskaran
No ratings yet
Introduction To CFD Basics Rajesh Bhaskaran
17 pages
Factors That Influence The Distribution of Plants and Animals
No ratings yet
Factors That Influence The Distribution of Plants and Animals
17 pages
0 Offer Letter
No ratings yet
0 Offer Letter
5 pages
Entrepreneurial Management: Perspective of Entrepreneurship
No ratings yet
Entrepreneurial Management: Perspective of Entrepreneurship
6 pages
3.0 Central Processing Unit: ITE 1922 - ICT Applications
No ratings yet
3.0 Central Processing Unit: ITE 1922 - ICT Applications
7 pages
PATTERN Practical Research 1 2 1
No ratings yet
PATTERN Practical Research 1 2 1
17 pages
List of Major Customer: Supplier Name Insert Supplier Logo
0% (1)
List of Major Customer: Supplier Name Insert Supplier Logo
3 pages
11 Board Question Paper Maths II November 2020 - 6598093377c7e
No ratings yet
11 Board Question Paper Maths II November 2020 - 6598093377c7e
4 pages
AgriculturalEngineering - Icomm PDF
100% (1)
AgriculturalEngineering - Icomm PDF
223 pages
Alfa Laval Decanter Centrifuge Reduces Chemical Losses in Green Liquor Dregs
No ratings yet
Alfa Laval Decanter Centrifuge Reduces Chemical Losses in Green Liquor Dregs
2 pages
User Manual en Skimmer EM0130 EM0140 EMEM22010612
No ratings yet
User Manual en Skimmer EM0130 EM0140 EMEM22010612
4 pages
GKJ Sir Book 4.3 - 24425
No ratings yet
GKJ Sir Book 4.3 - 24425
60 pages
The Aluminizing in Powder Technology of AISI 304 S PDF
No ratings yet
The Aluminizing in Powder Technology of AISI 304 S PDF
5 pages

2 Program

Uploaded by

2 Program

Uploaded by

Experiment 2

Develop a program to Compute the correlation matrix

Why Should You Use a Correlation Matrix?

2.Heatmap for Correlation Matrix

 sns.heatmap() creates a heatmap visualization.

 Positive correlations (values close to +1 in blue shades).

Why Use a Pair Plot?

In [13]: import numpy as np

# Load California Housing dataset

0 MedInc 20640 non-null float64

First Five Rows of Dataset:

In [5]: # Summary Statistics

AveOccup Latitude Longitude Target

In [12]: # Explanation of Summary Statistics

In [6]: # Check for missing values

In [7]: # Histograms for distribution of features

<Figure size 864x576 with 0 Axes>

In [8]: # Boxplots for outlier detection

In [10]: # Pairplot to analyze feature relationships (only a subset for clarity)

# Insights from Data Exploration

You might also like