2 Program
2 Program
Introduction
In data analysis and machine learning, understanding the relationships
between features is crucial for feature selection, multicollinearity detection,
and data interpretation. Correlation and pair plots are two essential techniques
to analyze these relationships.
1.Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables. It
helps in understanding how strongly features are related to each other.
Types of Correlation
Positive Correlation (+1 to 0): As one feature increases, the other also
increases.
Negative Correlation (0 to -1): As one feature increases, the other decreases.
No Correlation (0): No linear relationship between the variables.
A correlation heatmap is used to visualize the relationship between numerical features in a dataset. It
displays:
Each cell in the heatmap represents the correlation coefficient (r-value) between two features.
Correlation (r) Meaning
+1.0 Perfect positive correlation (as X increases, Y increases)
+0.5 Moderate positive correlation
0.0 No correlation (X and Y are independent)
-0.5 Moderate negative correlation
-1.0 Perfect negative correlation (as X increases, Y decreases)
3.Pair Plot
A pair plot (also known as a scatterplot matrix) is a collection of scatter plots
for every pair of numerical variables in the dataset. It helps in visualizing
relationships between variables.
# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target # Adding the target variable (median house
value)
In [11]: # Table of Meaning of Each Variable
variable_meaning = {
"MedInc": "Median income in block group",
"HouseAge": "Median house age in block group",
"AveRooms": "Average number of rooms per household",
"AveBedrms": "Average number of bedrooms per
household", "Population": "Population of block
group",
"AveOccup": "Average number of household members",
"Latitude": "Latitude of block group",
"Longitude": "Longitude of block group",
"Target": "Median house value (in $100,000s)"
}
variable_df = pd.DataFrame(list(variable_meaning.items()),
columns=["Feature", "Description"])
print("\nVariable Meaning Table:")
print(variable_df)
Variable Meaning Table:
Feature Description
0 MedInc Median income in block group
1 HouseAge Median house age in block group
2 AveRooms Average number of rooms per household
3 AveBedrms Average number of bedrooms per household
4 Population Population of block group
5 AveOccup Average number of household members
6 Latitude Latitude of block group
7 Longitude Longitude of block group
8 Target Median house value (in $100,000s)
In [4]: # Basic Data Exploration
print("\nBasic Information about
Dataset:") print(df.info()) # Overview of
dataset
print("\nFirst Five Rows of Dataset:")
print(df.head()) # Display first few rows
Basic Information about Dataset:
<class
'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to
20639 Data columns (total 9
columns):
# Column Non-Null Count Dtype
Longitude Target
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
1243.333333
The summary statistics table provides key percentiles and other descriptive
metrics for each numerical feature:
- **25% (First Quartile - Q1):** This represents the value below which 25% of
the da ta falls. It helps in understanding the lower bound of typical data
values.
- **50% (Median - Q2):** This is the middle value when the data is sorted. It
provid es the central tendency of the dataset.
- **75% (Third Quartile - Q3):** This represents the value below which 75% of
the da ta falls. It helps in identifying the upper bound of typical values
in the dataset.
- These percentiles are useful for detecting skewness, data distribution, and
identi fying potential outliers (values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR).
Key Insights:
1. The dataset has 20640 rows and 9 columns.
2. No missing values were found in the dataset.
3. Histograms show skewed distributions in some features like
'MedInc'.
4. Boxplots indicate potential outliers in 'AveRooms' and 'AveOccup'.
5. Correlation heatmap shows 'MedInc' has the highest correlation
with house prices.