0% found this document useful (0 votes)
28 views6 pages

ML Lab Manual PRGM 2&3

The document outlines two programs: the first computes the Pearson correlation coefficient and covariance matrix for the Iris dataset, visualizing the results with scatter plots and heatmaps. The second program implements Principal Component Analysis (PCA) to reduce the Iris dataset's dimensionality from four features to two, while preserving variance for better visualization. Key outputs include correlation coefficients, covariance matrices, and explained variance ratios from PCA.

Uploaded by

Tarun kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

ML Lab Manual PRGM 2&3

The document outlines two programs: the first computes the Pearson correlation coefficient and covariance matrix for the Iris dataset, visualizing the results with scatter plots and heatmaps. The second program implements Principal Component Analysis (PCA) to reduce the Iris dataset's dimensionality from four features to two, while preserving variance for better visualization. Key outputs include correlation coefficients, covariance matrices, and explained variance ratios from PCA.

Uploaded by

Tarun kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PROGRAM 2

2. Develop a program to Load a dataset with at least two numerical columns (e.g., Iris, Titanic). Plot a
scatter plot of two variables and calculate their Pearson correlation coefficient. Write a program to
compute the covariance and correlation matrix for a dataset. Visualize the correlation matrix using a
heatmap to know which variables have strong positive/negative correlations.
Solution:
• Pearson correlation coefficient

The Pearson correlation coefficient (r) measures the linear relationship between two variables. It tells us
how strongly and in what direction two variables are related.

• Covariance

Covariance measures the direction of the relationship between two variables. It determines whether two
variables move together (positive covariance) or in opposite directions (negative covariance).

• Correlation Matrix

A correlation matrix is a table showing the Pearson correlation coefficients between multiple variables
in a dataset. It helps in understanding the relationships among different numerical variable.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Iris dataset


data = sns.load_dataset('iris')

# Select two numerical columns for scatter plot and correlation calculations
x_col = 'sepal_length'
y_col = 'petal_length'

# Compute Pearson correlation coefficient


correlation = data[[x_col, y_col]].corr('pearson')
print("Pearson Correlation Coefficient:\n", correlation)

# Compute covariance matrix


covariance = data[[x_col, y_col]].cov()
print("Covariance Matrix:\n", covariance)

# Create scatter plot


plt.figure(figsize=(8, 5))
plt.scatter(data[x_col], data[y_col])
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title(f"Scatter Plot of {x_col} vs {y_col}")
plt.show()

# Calculate covariance and correlation matrix for the dataset


data_co = data.iloc[:, :-1] # Excluding the categorical column
covariance_matrix = data_co.cov()
correlation_matrix = data_co.corr()

print("Covariance Matrix:\n", covariance_matrix)


print("\nCorrelation Matrix:\n", correlation_matrix)

# Visualize correlation matrix using heatmap


plt.figure(figsize=(8, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
Output:

Pearson Correlation Coefficient:


sepal_length petal_length
sepal_length 1.000000 0.871754
petal_length 0.871754 1.000000
Covariance Matrix:
sepal_length petal_length
sepal_length 0.685694 1.274315
petal_length 1.274315 3.116278

Covariance Matrix:
sepal_length sepal_width petal_length petal_width
sepal_length 0.685694 -0.042434 1.274315 0.516271
sepal_width -0.042434 0.189979 -0.329656 -0.121639
petal_length 1.274315 -0.329656 3.116278 1.295609
petal_width 0.516271 -0.121639 1.295609 0.581006

Correlation Matrix:
sepal_length sepal_width petal_length petal_width
sepal_length 1.000000 -0.117570 0.871754 0.817941
sepal_width -0.117570 1.000000 -0.428440 -0.366126
petal_length 0.871754 -0.428440 1.000000 0.962865
petal_width 0.817941 -0.366126 0.962865 1.000000
PROGRAM 3
3. Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality
of the Iris dataset from 4 features to 2.
Solution:
• Principal Component Analysis(Algorithm)
Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform a high-
dimensional dataset into a lower-dimensional space while preserving the most important information
(variance). It helps in simplifying data visualization and improving computational efficiency in machine
learning models.
Steps in PCA Algorithm:
Step 1: Standardization of Data
Since PCA is affected by scale, we normalize or standardize the dataset by subtracting the mean and
dividing by the standard deviation:
Step 2: Compute the Covariance Matrix
To understand how variables relate to each other, we compute the covariance matrix:

Step 3: Compute Eigenvalues and Eigenvectors


Eigenvalues (λ): Measure the amount of variance explained by each principal component.
Eigenvectors (v): Indicate the direction of the new feature axes (principal components).
Cov(X)v=λv
Step 4: Select the Top Principal Components
Sort eigenvalues in descending order.
Select the top k eigenvectors corresponding to the highest eigenvalues.
The number of principal components (kkk) is chosen based on the explained variance ratio

Step 5: Transform the Data


Project the original data onto the new lower-dimensional space using:
Z=XW
where:
W is the matrix of selected eigenvectors (principal components).
Z is the transformed dataset with reduced dimensions.
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X = iris.data # Features (4-dimensional)
y = iris.target # Target labels
# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Applying PCA to reduce dimensions from 4 to 2
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plotting the PCA result
plt.figure(figsize=(8, 6))
colors = ['r', 'g', 'b']
labels = iris.target_names
for i, color, label in zip(range(3), colors, labels):
plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], color=color, label=label)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.grid()
plt.show()
# Explained variance ratio
print("Explained variance ratio:", pca.explained_variance_ratio_)
Output:

Explained variance ratio: [0.72962445 0.22850762]

You might also like