0% found this document useful (0 votes)
5 views

Data Science Practicals

The document is a lab manual for Data Science practicals at M.S. College of Science, Arts, and Commerce for the academic year 2024-2025. It includes various practical exercises covering topics such as Excel usage, data pre-processing, hypothesis testing, ANOVA, regression analysis, and data visualization. Each practical outlines specific aims, steps, and code examples to guide students through the learning process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Science Practicals

The document is a lab manual for Data Science practicals at M.S. College of Science, Arts, and Commerce for the academic year 2024-2025. It includes various practical exercises covering topics such as Excel usage, data pre-processing, hypothesis testing, ANOVA, regression analysis, and data visualization. Each practical outlines specific aims, steps, and code examples to guide students through the learning process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Habib Education and Welfare Society’s

M.S. COLLEGE OF SCIENCE, ARTS, COMMERCE, BSC


(IT), BSC (CS), B.COM, BMS. (DEVGHAR)

MUMBAI UNIVERSITY

DATA SCIENCE

LAB MANUAL

(A.Y. 2024 – 2025)


| Data Science | |T.Y.C. S| |2024-25|

CERTIFICATE
DEPARTMENT OF COMPUTER SCIENCE

This is to certify that Mr. / Miss.

of B.Sc. (CS) Semester VI, Roll No. has successfully completed the practical’s in

the subject of Data Science as per the requirement of University of Mumbai in part fulfillment for the

completion of Degree of Bachelor of Science (Computer Science). It is also to certify that this is the

original work of the candidate done during the academic year 2024-2025.

Internal Examiner Subject Teacher

H.O.D
DEPARTMENT OF C.S.

DATE OF SUBMISSION: COLLEGE SEAL

2
| Data Science | |T.Y.C. S| |2024-25|

INDEX
SR
Practical Name Date Sign
NO
.
1. Introduction to Excel
a. Perform conditional formatting on a dataset using various
criteria.
b. Create a pivot table to analyses and summarize data.

c. Use VLOOKUP function to retrieve information from a


different worksheet or table.
d. Perform what-if analysis using Goal Seek to determine input
values for desired output.
2. Data Frames and Basic Data Pre-processing
a. Read data from CSV and JSON files into a data frame.
b. Perform basic data pre-processing tasks such as handling
missing values and outliers.
c. Manipulate and transform data using functions like filtering,
sorting, and grouping.
3. Feature Scaling and Dummification
a. Apply feature-scaling techniques like standardization and
normalization to numerical features.
b. Perform feature dummification to convert categorical variables
into numerical representations.
4. Hypothesis Testing
a. Conduct a hypothesis test using appropriate statistical tests (e.g., t-
test, chi square test).
i) t-test
ii) chi square test
5. ANOVA (Analysis of Variance)
a. Perform one-way ANOVA to compare means across multiple
groups.
b. Conduct post-hoc tests to identify significant differences between
group means.
6. Regression and Its Types

a. Implement simple linear regression using a dataset.


b. Extend the analysis to multiple linear regression and assess the
impact of additional predictors.
7. Logistic Regression and Decision Tree

8. K-Means Clustering

9. Principal Component Analysis (PCA)


10. Data Visualization and Storytelling

3
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 1

Aim: Introduction to Excel:

A. Perform conditional formatting on a dataset using various criteria.

Steps
Step 1: Go to conditional formatting > Greater Than

4
| Data Science | |T.Y.C. S| |2024-25|

Step 2: Enter the greater than filter value for example 2000.

Step 3: Go to Data Bars > Solid Fill in conditional formatting.

5
| Data Science | |T.Y.C. S| |2024-25|

B. Create a pivot table to analyses and summarize data.

Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivot
chart.
Step 2: Select “New worksheet” in the create pivot chart
window.

Step 3: Select and drag attributes in the below boxes.

6
| Data Science | |T.Y.C. S| |2024-25|

C. Use VLOOKUP function to retrieve information from a


different worksheet or table.

Steps:

Step 1: click on an empty cell and type the following command.


=VLOOKUP (B3, B3:D3,1, TRUE)

D. Perform what-if analysis using Goal Seek to determine input


values for desired output.

Steps: -
Step 1: In the Data tab go to the what if analysis>Goal seeks.

7
| Data Science | |T.Y.C. S| |2024-25|

Step 2: Fill the information in the window accordingly and click ok.

8
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 2

Aim: Data Frames and Basic Data Pre-processing


A. Read data from CSV and JSON files into a data frame.

(1)
# Read data from a csv
fileimport pandas as
pd
df =
pd.read_csv('Student_Marks.cs)
print("Our dataset ")
print(df)

(2)
# Reading data from a
JSON fileimport pandas as
pd
data =
pd.read_json('dataset.json')
print(data)

9
| Data Science | |T.Y.C. S| |2024-25|

B. Perform basic data pre-processing tasks such as handling missing


values and outliers.Code:

(1)
# Replacing NA values using
fillna()import pandas as pd

df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values
with 0 : ")df2=df.fillna(value=0)
print(df2)

10
| Data Science | |T.Y.C. S| |2024-25|

(2)
# Dropping NA values using
dropna()import pandas as pd
df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)

print("Dataset after dropping NA


values: ")df.dropna(inplace = True)
print(df)

C. Manipulate and transform data using functions like filtering, sorting, and

grouping

Code:
import pandas as pd

# Load iris dataset


iris = pd.read_csv('Iris.csv')
# Filtering data based on a condition
setosa = iris[iris['Species'] == 'setosa']
print("Setosa samples:")

11
| Data Science | |T.Y.C. S| |2024-25|

print(setosa.head())
# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm',
ascending=False)print("\nSorted iris dataset:")
print(sorted_iris.head())

# Grouping data
grouped_species =
iris.groupby('Species').mean()
print("\nMean measurements for each
species:") print(grouped_species)

12
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 3

Aim: Feature Scaling and Dummification


A. Apply feature-scaling techniques like standardization and
normalization to numerical features.
Code:

# Standardization and
normalizationimport pandas as
pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2],
skiprows=1)df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original
DataFrame:")print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax
Scaling")print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df[['Alcohol','Malic
Acid']])df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard
Scaling")print(df)

13
| Data Science | |T.Y.C. S| |2024-25|

B. Perform feature Dummification to convert categorical variables into numerical


representations.

Code:

import pandas as pd
iris=pd.read_csv("Iris.c
sv")print(iris)
from sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)

14
| Data Science | |T.Y.C. S| |2024-25|

15
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 4

Aim: Hypothesis Testing


A. Conduct a hypothesis test using appropriate statistical tests
(e.g., t-test, chi square test).

Code: -
# t-test
import numpy as np
from scipy import
stats
import matplotlib.pyplot as plt

# Generate two samples for demonstration


purposesnp.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2,
size=30)sample2 = np.random.normal(loc=12,
scale=2, size=30)

# Perform a two-sample t-test


t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Set the significance


levelalpha = 0.05

print("Results of Two-Sample t-
test:")print(f'T-statistic:
{t_statistic}') print(f'P-value:
{p_value}')

16
| Data Science | |T.Y.C. S| |2024-25|

print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

# Plot the distributions


plt.figure(figsize=(10, 6))

plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')

plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')


plt.axvline(np.mean(sample1), color='blue', linestyle='dashed',
linewidth=2) plt.axvline(np.mean(sample2), color='orange',
linestyle='dashed', linewidth=2)plt.title('Distributions of Sample 1
and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency’)
plt.legend()

# Highlight the critical region if null hypothesis is


rejectedif p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()),
max(sample1.max(),sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3,
label='Critical Region')plt.text(11, 5, f'T-statistic: {t_statistic:.2f}',
ha='center', va='center', color='black',
backgroundcolor='white')

# Show the plot


plt.show()

# Draw Conclusions if p_value < alpha:


if np.mean(sample1) > np.mean(sample2):

17
| Data Science | |T.Y.C. S| |2024-25|

print("Conclusion: There is significant evidence to reject the null hypothesis.")


print("Interpretation: The mean of Sample 1 is significantly higher than that of
Sample 2.")
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean of Sample 2 is significantly higher than that of
Sample
1.")
else:

print("Conclusion: Fail to reject the null hypothesis.")


print("Interpretation: There is not enough evidence to claim a significant
differencebetween the means.")

Output:

18
| Data Science | |T.Y.C. S| |2024-25|

#chi-test
import pandas as pd
import numpy as np
import matplotlib as
pltimport seaborn
as sb import
warnings

from scipy import stats


warnings.filterwarnings('ign
ore')
df=sb.load_dataset('mpg')
print(df)
print(df['horsepower'].descri
be())
print(df['model_year'].descri
be())bins=[0,75,150,240]
df['horsepower_new']=pd.cut(df['horsepower'],bins=bins,labels=['l','m','h'])
c=df['horsepower_new']
print(c)
ybins=[69,72,74,
84]
label=['t1','t2','t3']
df['modelyear_new']=pd.cut(df['model_year'],bins=ybins,labels=label
) newyear=df['modelyear_new']
print(newyear)

19
| Data Science | |T.Y.C. S| |2024-25|

df_chi=pd.crosstab(df['horsepower_new'],df['modelyear_ne
w']) print(df_chi)
print(stats.chi2_contingency(df
_chi)Output:

20
| Data Science | |T.Y.C. S| |2024-25|

Conclusion:
There is sufficient evidence to reject the null hypothesis, indicating that
there is a significant association between 'horsepower_new' and
'modelyear_new' categories.

21
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 5
Aim: - ANOVA (Analysis of Variance)
A. Perform one-way ANOVA to compare means across multiple groups.
B. Conduct post-hoc tests to identify significant differences between group means.

import pandas as pd
import scipy.stats as
stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

group1 = [23, 25, 29, 34, 30]


group2 = [19, 20, 22, 24, 25]
group3 = [15, 18, 20, 21, 17]
group4 = [28, 24, 26, 30, 29]

# Combine data into a DataFrame


data = pd.DataFrame({'value': group1 + group2 + group3 + group4,
'group': ['Group1'] * len(group1) + ['Group2'] * len(group2) +
['Group3'] * len(group3) + ['Group4'] * len(group4)})

# Perform one-way ANOVA


f_statistics, p_value = stats.f_oneway(group1, group2, group3, group4)
print("one-way ANOVA:")
print("F-statistics:", f_statistics)
print("p-value", p_value)

# Perform Tukey-Kramer post-hoc test


tukey_results = pairwise_tukeyhsd(data['value'], data['group'])
print("\nTukey-Kramer post-hoc test:")
print(tukey_results)

22
| Data Science | |T.Y.C. S| |2024-25|

Output:-

Conclusion

• F-statistic: This value indicates the ratio of the variance between groups to the
variance within groups. A larger F-statistic suggests that the means of the
groups are more different from each other compared to the variability within
each group.
• If the p-value is less than the chosen significance level (e.g., 0.05), it suggests
that there are significant differences among the group means.
• If the p-value is greater than the significance level, it suggests that there is
insufficient evidence to reject the null hypothesis, meaning there are no
significant differences among the group means.
• There are significant differences among the means of the groups.

23
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 6

Aim: - Regression and its Types.

A. Implement simple linear regression using a dataset.

import numpy as
npimport pandas
as pd
from sklearn.datasets import
fetch_california_housingfrom
sklearn.model_selection import
train_test_split from sklearn.linear_model
import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data,
columns=housing.feature_names)print(housing_df)

housing_df['PRICE'] = housing.target

X=
housing_df[['AveRooms']
]y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

24
| Data Science | |T.Y.C. S| |2024-25|

model = LinearRegression()

model.fit(X_train, y_train)

mse = mean_squared_error(y_test,
model.predict(X_test))r2 = r2_score(y_test,
model.predict(X_test))

print("Mean Squared Error:",


mse) print("R-squared:", r2)
print("Intercept:",
model.intercept_)
print("Coefficient:",
model.coef_)

output:

25
| Data Science | |T.Y.C. S| |2024-25|

B. Extend the analysis to multiple linear regression and assess the impact of
additional predictors.

#Multiple Liner Regression

X = housing_df.drop('PRICE',axis=1 )y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)
mse =
mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print("Mean Squared Error:",mse)


print("R-squared:",r2)
print("Intercept:",model.interc ept_)
print("Coefficient:",model.co ef_)

Output:

26
| Data Science | |T.Y.C. S| |2024-25|

Conclusion

The Mean Squared Error (MSE) is a commonly used metric to evaluate the
performance of regression models. It measures the average squared difference
between the predicted values and the actual values of the target variable. A lower
MSE value indicates that the model's predictions are closer to the actual values on
average, suggesting better performance.
R2 tells us how well the independent variables explain the variability of the
dependent variable. It ranges from 0 to 1, where: 0 indicates that the model does not
explain any of the variability of the dependent variable around its mean.1 indicates
that the model explains all the variability of the dependent variable around its mean.

The intercept represents the point where the regression line intersects the y-axis on
a graph. It provides information about the baseline value of the dependent variable
when all predictors are zero.
Coefficients represent the impact of changes in the independent variables on the
dependent variable in a linear regression model.

27
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 7
Aim: - Logistic Regression and Decision Tree

import numpy as
npimport pandas
as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.linear_model
import LogisticRegressionfrom sklearn.tree
import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score,classification_report
# Load the Iris dataset and create a binary
classification problemiris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
binary_df =
iris_df[iris_df['target'] != 2]X =
binary_df.drop('target', axis=1)
y = binary_df['target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)# Train a logistic regression model and evaluate its
performance
logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train) y_pred_logistic
= logistic_model.predict(X_test)

28
| Data Science | |T.Y.C. S| |2024-25|

print("Logistic Regression Metrics")


print("Accuracy: ", accuracy_score(y_test,
y_pred_logistic)) print("Precision:",
precision_score(y_test, y_pred_logistic))
print("Recall: ", recall_score(y_test,
y_pred_logistic))

print("\nClassification Report") print(classification_report(y_test, y_pred_logistic))


# Train a decision tree model and evaluate its
performancedecision_tree_model =
DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree =
decision_tree_model.predict(X_test)
print("\nDecision Tree Metrics")
print("Accuracy: ", accuracy_score(y_test,
y_pred_tree))print("Precision:",
precision_score(y_test, y_pred_tree))
print("Recall: ", recall_score(y_test,
y_pred_tree)) print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))
Output:-

29
| Data Science | |T.Y.C. S| |2024-25|

Conclusion:

Precision: Precision measures the ratio of correctly predicted positive observations to the total
predicted positives. It indicates the accuracy of positive predictions made by the model. A higher precision indicates
fewer false positives. Ideally, precision should be as close to 1.0 as possible. A precision of 1.0 indicates that all
positive predictions made by the model are correct, with no false positives.
Recall (also called Sensitivity or True Positive Rate): Recall measures the
ratio of correctly predicted positive observations to all actual positives in the dataset.
It indicates the model's ability to identify all positive instances correctly. A higher
recall indicates fewer false negatives. an ideal recall score would also be 1.0. A recall
of 1.0 indicates that the model correctly identifies all positive instances in the dataset,
with no false negatives.
F1-score: The F1-score is the harmonic mean of precision and recall. It
provides a balance between precision and recall. F1-score reaches its best value at 1
and worst at 0. It provides a balance between precision and recall. A higher F1-score
indicates better overall performance, especially in scenarios where we want to
balance false positives and false negatives.
Support: Support is the number of actual occurrences of the class in the
specified dataset. It represents the number of samples in each class. The support
value for each class ideally reflects a well-distributed dataset, with enough samples
for each class.

30
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 8
Aim: - K-Means Clustering

import pandas as pd
from sklearn.preprocessing import
MinMaxScalerfrom sklearn.cluster import
KMeans
import matplotlib.pyplot as plt

data =
pd.read_csv("C:\\Users\Reape\Downloads\wholesale\wholesale.c
sv")data.head()

categorical_features = ['Channel', 'Region']


continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',
'Delicassen']data[continuous_features].describe()

for col in categorical_features:


dummies = pd.get_dummies(data[col],
prefix = col)data = pd.concat([data,
dummies], axis = 1) data.drop(col, axis = 1,
inplace = True)
data.head()

mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)

31
| Data Science | |T.Y.C. S| |2024-25|

sum_of_squared_distances
= []K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)

plt.plot(K,sum_of_squared_distances,
'bx-') plt.xlabel('k')
plt.ylabel('sum_of_squared_distances'
) plt.title('elbow Mehtod for optimal
k') plt.show()

Output:

32
| Data Science | |T.Y.C. S| |2024-25|

Conclusion:

• The elbow method helps in determining the optimal number of clusters for the
dataset. The point where the rate of decrease in the sum of squared distances
significantly slows down suggests a suitable number of clusters.
• We conclude that the optimal number of clusters for the data is 5.
• The optimal number of clusters identified using this method can be used for
further analysis or segmentation of customers based on their purchasing
behavior.

33
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 9
Aim: - Principal Component Analysis (PCA)

import pandas as
pdimport numpy
as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import
StandardScalerfrom
sklearn.decomposition import PCA

iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
X = iris_df.drop('target',
axis=1)y = iris_df['target']

scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio =
pca.explained_variance_ratio_

34
| Data Science | |T.Y.C. S| |2024-25|

plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio), marker='o',
linestyle='--')plt.title('Explained Variance Ratio')
plt.xlabel('Number of Principal
Components') plt.ylabel('Cumulative
Explained Variance Ratio')plt.grid(True)

plt.show()

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of principal components to explain 95% variance:
{n_components}")

pca =
PCA(n_components=n_components)
X_reduced =
pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50,
alpha=0.5)plt.title('Data in Reduced-dimensional Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal
Component 2')
plt.colorbar(label='Target')
plt.show()

35
| Data Science | |T.Y.C. S| |2024-25|

Output:

Conclusion:
In conclusion, the code demonstrates the effectiveness of PCA in reducing the
dimensionality of high-dimensional datasets while preserving essential information.
It provides a systematic approach to exploring and visualizing complex datasets,
thereby aiding in data analysis and interpretation.

36
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 10
Aim: - Data Visualization and Storytelling

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate random data
np.random.seed(42) # Set a seed for reproducibility
# Create a DataFrame with random data
data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]),
dtype='category')
})
# Create a scatter plot to visualize the relationship between two variables
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5)
plt.title('Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.show()
# Create a bar chart to visualize the distribution of a categorical variable
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data)
plt.title('Distribution of Categories', fontsize=16)
plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()
# Create a heatmap to visualize the correlation between numerical variables
plt.figure(figsize=(10, 8))
numerical_cols = ['variable1', 'variable2', 'variable3']
sns.heatmap(data[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=16)
plt.show()
# Data Storytelling
print("Title: Exploring the Relationship between Variable 1 and Variable 2")

37
| Data Science | |T.Y.C. S| |2024-25|

print("\nThe scatter plot (Figure 1) shows the relationship between Variable 1 and
Variable 2. ")
print("\nScatter Plot")
print("Figure 1: Scatter Plot of Variable 1 and Variable 2")
print("\nTo better understand the distribution of the categorical variable 'category',
we created a ")
print("\nBar Chart")
print("Figure 2: Distribution of Categories")
print("\nAdditionally, we explored the correlation between numerical variables using
a heatmap ")
print("\nHeatmap")
print("Figure 3: Correlation Heatmap")

print("\nIn summary, the visualizations and analysis provide insights into the
relationships ")

Output:

38
| Data Science | |T.Y.C. S| |2024-25|

39
| Data Science | |T.Y.C. S| |2024-25|

40

You might also like