Data Science Practicals
Data Science Practicals
MUMBAI UNIVERSITY
DATA SCIENCE
LAB MANUAL
CERTIFICATE
DEPARTMENT OF COMPUTER SCIENCE
of B.Sc. (CS) Semester VI, Roll No. has successfully completed the practical’s in
the subject of Data Science as per the requirement of University of Mumbai in part fulfillment for the
completion of Degree of Bachelor of Science (Computer Science). It is also to certify that this is the
original work of the candidate done during the academic year 2024-2025.
H.O.D
DEPARTMENT OF C.S.
2
| Data Science | |T.Y.C. S| |2024-25|
INDEX
SR
Practical Name Date Sign
NO
.
1. Introduction to Excel
a. Perform conditional formatting on a dataset using various
criteria.
b. Create a pivot table to analyses and summarize data.
8. K-Means Clustering
3
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 1
Steps
Step 1: Go to conditional formatting > Greater Than
4
| Data Science | |T.Y.C. S| |2024-25|
Step 2: Enter the greater than filter value for example 2000.
5
| Data Science | |T.Y.C. S| |2024-25|
Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivot
chart.
Step 2: Select “New worksheet” in the create pivot chart
window.
6
| Data Science | |T.Y.C. S| |2024-25|
Steps:
Steps: -
Step 1: In the Data tab go to the what if analysis>Goal seeks.
7
| Data Science | |T.Y.C. S| |2024-25|
Step 2: Fill the information in the window accordingly and click ok.
8
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 2
(1)
# Read data from a csv
fileimport pandas as
pd
df =
pd.read_csv('Student_Marks.cs)
print("Our dataset ")
print(df)
(2)
# Reading data from a
JSON fileimport pandas as
pd
data =
pd.read_json('dataset.json')
print(data)
9
| Data Science | |T.Y.C. S| |2024-25|
(1)
# Replacing NA values using
fillna()import pandas as pd
df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values
with 0 : ")df2=df.fillna(value=0)
print(df2)
10
| Data Science | |T.Y.C. S| |2024-25|
(2)
# Dropping NA values using
dropna()import pandas as pd
df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
C. Manipulate and transform data using functions like filtering, sorting, and
grouping
Code:
import pandas as pd
11
| Data Science | |T.Y.C. S| |2024-25|
print(setosa.head())
# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm',
ascending=False)print("\nSorted iris dataset:")
print(sorted_iris.head())
# Grouping data
grouped_species =
iris.groupby('Species').mean()
print("\nMean measurements for each
species:") print(grouped_species)
12
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 3
# Standardization and
normalizationimport pandas as
pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2],
skiprows=1)df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original
DataFrame:")print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax
Scaling")print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df[['Alcohol','Malic
Acid']])df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard
Scaling")print(df)
13
| Data Science | |T.Y.C. S| |2024-25|
Code:
import pandas as pd
iris=pd.read_csv("Iris.c
sv")print(iris)
from sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)
14
| Data Science | |T.Y.C. S| |2024-25|
15
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 4
Code: -
# t-test
import numpy as np
from scipy import
stats
import matplotlib.pyplot as plt
print("Results of Two-Sample t-
test:")print(f'T-statistic:
{t_statistic}') print(f'P-value:
{p_value}')
16
| Data Science | |T.Y.C. S| |2024-25|
17
| Data Science | |T.Y.C. S| |2024-25|
Output:
18
| Data Science | |T.Y.C. S| |2024-25|
#chi-test
import pandas as pd
import numpy as np
import matplotlib as
pltimport seaborn
as sb import
warnings
19
| Data Science | |T.Y.C. S| |2024-25|
df_chi=pd.crosstab(df['horsepower_new'],df['modelyear_ne
w']) print(df_chi)
print(stats.chi2_contingency(df
_chi)Output:
20
| Data Science | |T.Y.C. S| |2024-25|
Conclusion:
There is sufficient evidence to reject the null hypothesis, indicating that
there is a significant association between 'horsepower_new' and
'modelyear_new' categories.
21
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 5
Aim: - ANOVA (Analysis of Variance)
A. Perform one-way ANOVA to compare means across multiple groups.
B. Conduct post-hoc tests to identify significant differences between group means.
import pandas as pd
import scipy.stats as
stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
22
| Data Science | |T.Y.C. S| |2024-25|
Output:-
Conclusion
• F-statistic: This value indicates the ratio of the variance between groups to the
variance within groups. A larger F-statistic suggests that the means of the
groups are more different from each other compared to the variability within
each group.
• If the p-value is less than the chosen significance level (e.g., 0.05), it suggests
that there are significant differences among the group means.
• If the p-value is greater than the significance level, it suggests that there is
insufficient evidence to reject the null hypothesis, meaning there are no
significant differences among the group means.
• There are significant differences among the means of the groups.
23
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 6
import numpy as
npimport pandas
as pd
from sklearn.datasets import
fetch_california_housingfrom
sklearn.model_selection import
train_test_split from sklearn.linear_model
import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data,
columns=housing.feature_names)print(housing_df)
housing_df['PRICE'] = housing.target
X=
housing_df[['AveRooms']
]y = housing_df['PRICE']
24
| Data Science | |T.Y.C. S| |2024-25|
model = LinearRegression()
model.fit(X_train, y_train)
mse = mean_squared_error(y_test,
model.predict(X_test))r2 = r2_score(y_test,
model.predict(X_test))
output:
25
| Data Science | |T.Y.C. S| |2024-25|
B. Extend the analysis to multiple linear regression and assess the impact of
additional predictors.
X = housing_df.drop('PRICE',axis=1 )y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)
model = LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
mse =
mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)
Output:
26
| Data Science | |T.Y.C. S| |2024-25|
Conclusion
The Mean Squared Error (MSE) is a commonly used metric to evaluate the
performance of regression models. It measures the average squared difference
between the predicted values and the actual values of the target variable. A lower
MSE value indicates that the model's predictions are closer to the actual values on
average, suggesting better performance.
R2 tells us how well the independent variables explain the variability of the
dependent variable. It ranges from 0 to 1, where: 0 indicates that the model does not
explain any of the variability of the dependent variable around its mean.1 indicates
that the model explains all the variability of the dependent variable around its mean.
The intercept represents the point where the regression line intersects the y-axis on
a graph. It provides information about the baseline value of the dependent variable
when all predictors are zero.
Coefficients represent the impact of changes in the independent variables on the
dependent variable in a linear regression model.
27
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 7
Aim: - Logistic Regression and Decision Tree
import numpy as
npimport pandas
as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.linear_model
import LogisticRegressionfrom sklearn.tree
import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score,classification_report
# Load the Iris dataset and create a binary
classification problemiris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
binary_df =
iris_df[iris_df['target'] != 2]X =
binary_df.drop('target', axis=1)
y = binary_df['target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)# Train a logistic regression model and evaluate its
performance
logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train) y_pred_logistic
= logistic_model.predict(X_test)
28
| Data Science | |T.Y.C. S| |2024-25|
29
| Data Science | |T.Y.C. S| |2024-25|
Conclusion:
Precision: Precision measures the ratio of correctly predicted positive observations to the total
predicted positives. It indicates the accuracy of positive predictions made by the model. A higher precision indicates
fewer false positives. Ideally, precision should be as close to 1.0 as possible. A precision of 1.0 indicates that all
positive predictions made by the model are correct, with no false positives.
Recall (also called Sensitivity or True Positive Rate): Recall measures the
ratio of correctly predicted positive observations to all actual positives in the dataset.
It indicates the model's ability to identify all positive instances correctly. A higher
recall indicates fewer false negatives. an ideal recall score would also be 1.0. A recall
of 1.0 indicates that the model correctly identifies all positive instances in the dataset,
with no false negatives.
F1-score: The F1-score is the harmonic mean of precision and recall. It
provides a balance between precision and recall. F1-score reaches its best value at 1
and worst at 0. It provides a balance between precision and recall. A higher F1-score
indicates better overall performance, especially in scenarios where we want to
balance false positives and false negatives.
Support: Support is the number of actual occurrences of the class in the
specified dataset. It represents the number of samples in each class. The support
value for each class ideally reflects a well-distributed dataset, with enough samples
for each class.
30
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 8
Aim: - K-Means Clustering
import pandas as pd
from sklearn.preprocessing import
MinMaxScalerfrom sklearn.cluster import
KMeans
import matplotlib.pyplot as plt
data =
pd.read_csv("C:\\Users\Reape\Downloads\wholesale\wholesale.c
sv")data.head()
mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)
31
| Data Science | |T.Y.C. S| |2024-25|
sum_of_squared_distances
= []K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
plt.plot(K,sum_of_squared_distances,
'bx-') plt.xlabel('k')
plt.ylabel('sum_of_squared_distances'
) plt.title('elbow Mehtod for optimal
k') plt.show()
Output:
32
| Data Science | |T.Y.C. S| |2024-25|
Conclusion:
• The elbow method helps in determining the optimal number of clusters for the
dataset. The point where the rate of decrease in the sum of squared distances
significantly slows down suggests a suitable number of clusters.
• We conclude that the optimal number of clusters for the data is 5.
• The optimal number of clusters identified using this method can be used for
further analysis or segmentation of customers based on their purchasing
behavior.
33
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 9
Aim: - Principal Component Analysis (PCA)
import pandas as
pdimport numpy
as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import
StandardScalerfrom
sklearn.decomposition import PCA
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
X = iris_df.drop('target',
axis=1)y = iris_df['target']
scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X)
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio =
pca.explained_variance_ratio_
34
| Data Science | |T.Y.C. S| |2024-25|
plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio), marker='o',
linestyle='--')plt.title('Explained Variance Ratio')
plt.xlabel('Number of Principal
Components') plt.ylabel('Cumulative
Explained Variance Ratio')plt.grid(True)
plt.show()
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of principal components to explain 95% variance:
{n_components}")
pca =
PCA(n_components=n_components)
X_reduced =
pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50,
alpha=0.5)plt.title('Data in Reduced-dimensional Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal
Component 2')
plt.colorbar(label='Target')
plt.show()
35
| Data Science | |T.Y.C. S| |2024-25|
Output:
Conclusion:
In conclusion, the code demonstrates the effectiveness of PCA in reducing the
dimensionality of high-dimensional datasets while preserving essential information.
It provides a systematic approach to exploring and visualizing complex datasets,
thereby aiding in data analysis and interpretation.
36
| Data Science | |T.Y.C. S| |2024-25|
PRACTICAL 10
Aim: - Data Visualization and Storytelling
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate random data
np.random.seed(42) # Set a seed for reproducibility
# Create a DataFrame with random data
data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]),
dtype='category')
})
# Create a scatter plot to visualize the relationship between two variables
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5)
plt.title('Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.show()
# Create a bar chart to visualize the distribution of a categorical variable
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data)
plt.title('Distribution of Categories', fontsize=16)
plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()
# Create a heatmap to visualize the correlation between numerical variables
plt.figure(figsize=(10, 8))
numerical_cols = ['variable1', 'variable2', 'variable3']
sns.heatmap(data[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=16)
plt.show()
# Data Storytelling
print("Title: Exploring the Relationship between Variable 1 and Variable 2")
37
| Data Science | |T.Y.C. S| |2024-25|
print("\nThe scatter plot (Figure 1) shows the relationship between Variable 1 and
Variable 2. ")
print("\nScatter Plot")
print("Figure 1: Scatter Plot of Variable 1 and Variable 2")
print("\nTo better understand the distribution of the categorical variable 'category',
we created a ")
print("\nBar Chart")
print("Figure 2: Distribution of Categories")
print("\nAdditionally, we explored the correlation between numerical variables using
a heatmap ")
print("\nHeatmap")
print("Figure 3: Correlation Heatmap")
print("\nIn summary, the visualizations and analysis provide insights into the
relationships ")
Output:
38
| Data Science | |T.Y.C. S| |2024-25|
39
| Data Science | |T.Y.C. S| |2024-25|
40