0% found this document useful (0 votes)

5 views

Data Science Practicals

The document is a lab manual for Data Science practicals at M.S. College of Science, Arts, and Commerce for the academic year 2024-2025. It includes various practical exercises covering topics such as Excel usage, data pre-processing, hypothesis testing, ANOVA, regression analysis, and data visualization. Each practical outlines specific aims, steps, and code examples to guide students through the learning process.

Uploaded by

trendreelorvideo123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data Science Practicals

Uploaded by

trendreelorvideo123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Habib Education and Welfare Society’s

M.S. COLLEGE OF SCIENCE, ARTS, COMMERCE, BSC

(IT), BSC (CS), B.COM, BMS. (DEVGHAR)

MUMBAI UNIVERSITY

DATA SCIENCE

LAB MANUAL

(A.Y. 2024 – 2025)

CERTIFICATE
DEPARTMENT OF COMPUTER SCIENCE

This is to certify that Mr. / Miss.

of B.Sc. (CS) Semester VI, Roll No. has successfully completed the practical’s in

the subject of Data Science as per the requirement of University of Mumbai in part fulfillment for the

completion of Degree of Bachelor of Science (Computer Science). It is also to certify that this is the

original work of the candidate done during the academic year 2024-2025.

Internal Examiner Subject Teacher

H.O.D
DEPARTMENT OF C.S.

DATE OF SUBMISSION: COLLEGE SEAL

2
| Data Science | |T.Y.C. S| |2024-25|

INDEX
SR
Practical Name Date Sign
NO
.
1. Introduction to Excel
a. Perform conditional formatting on a dataset using various
criteria.
b. Create a pivot table to analyses and summarize data.

c. Use VLOOKUP function to retrieve information from a

different worksheet or table.
d. Perform what-if analysis using Goal Seek to determine input
values for desired output.
2. Data Frames and Basic Data Pre-processing
a. Read data from CSV and JSON files into a data frame.
b. Perform basic data pre-processing tasks such as handling
missing values and outliers.
c. Manipulate and transform data using functions like filtering,
sorting, and grouping.
3. Feature Scaling and Dummification
a. Apply feature-scaling techniques like standardization and
normalization to numerical features.
b. Perform feature dummification to convert categorical variables
into numerical representations.
4. Hypothesis Testing
a. Conduct a hypothesis test using appropriate statistical tests (e.g., t-
test, chi square test).
i) t-test
ii) chi square test
5. ANOVA (Analysis of Variance)
a. Perform one-way ANOVA to compare means across multiple
groups.
b. Conduct post-hoc tests to identify significant differences between
group means.
6. Regression and Its Types

a. Implement simple linear regression using a dataset.

b. Extend the analysis to multiple linear regression and assess the
impact of additional predictors.
7. Logistic Regression and Decision Tree

8. K-Means Clustering

9. Principal Component Analysis (PCA)

10. Data Visualization and Storytelling

3
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 1

Aim: Introduction to Excel:

A. Perform conditional formatting on a dataset using various criteria.

Steps
Step 1: Go to conditional formatting > Greater Than

4
| Data Science | |T.Y.C. S| |2024-25|

Step 2: Enter the greater than filter value for example 2000.

Step 3: Go to Data Bars > Solid Fill in conditional formatting.

5
| Data Science | |T.Y.C. S| |2024-25|

B. Create a pivot table to analyses and summarize data.

Steps
Step 1: select the entire table and go to Insert tab PivotChart > Pivot
chart.
Step 2: Select “New worksheet” in the create pivot chart
window.

Step 3: Select and drag attributes in the below boxes.

6
| Data Science | |T.Y.C. S| |2024-25|

C. Use VLOOKUP function to retrieve information from a

different worksheet or table.

Steps:

Step 1: click on an empty cell and type the following command.

=VLOOKUP (B3, B3:D3,1, TRUE)

D. Perform what-if analysis using Goal Seek to determine input

values for desired output.

Steps: -
Step 1: In the Data tab go to the what if analysis>Goal seeks.

7
| Data Science | |T.Y.C. S| |2024-25|

Step 2: Fill the information in the window accordingly and click ok.

8
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 2

Aim: Data Frames and Basic Data Pre-processing

A. Read data from CSV and JSON files into a data frame.

(1)
# Read data from a csv
fileimport pandas as
pd
df =
pd.read_csv('Student_Marks.cs)
print("Our dataset ")
print(df)

(2)
# Reading data from a
JSON fileimport pandas as
pd
data =
pd.read_json('dataset.json')
print(data)

9
| Data Science | |T.Y.C. S| |2024-25|

B. Perform basic data pre-processing tasks such as handling missing

values and outliers.Code:

(1)
# Replacing NA values using
fillna()import pandas as pd

df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values
with 0 : ")df2=df.fillna(value=0)
print(df2)

10
| Data Science | |T.Y.C. S| |2024-25|

(2)
# Dropping NA values using
dropna()import pandas as pd
df =
pd.read_csv('titanic.csv')
print(df)
df.head(10)

print("Dataset after dropping NA

values: ")df.dropna(inplace = True)
print(df)

C. Manipulate and transform data using functions like filtering, sorting, and

grouping

Code:
import pandas as pd

# Load iris dataset

iris = pd.read_csv('Iris.csv')
# Filtering data based on a condition
setosa = iris[iris['Species'] == 'setosa']
print("Setosa samples:")

11
| Data Science | |T.Y.C. S| |2024-25|

print(setosa.head())
# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm',
ascending=False)print("\nSorted iris dataset:")
print(sorted_iris.head())

# Grouping data
grouped_species =
iris.groupby('Species').mean()
print("\nMean measurements for each
species:") print(grouped_species)

12
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 3

Aim: Feature Scaling and Dummification

A. Apply feature-scaling techniques like standardization and
normalization to numerical features.
Code:

# Standardization and
normalizationimport pandas as
pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2],
skiprows=1)df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original
DataFrame:")print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax
Scaling")print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df[['Alcohol','Malic
Acid']])df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard
Scaling")print(df)

13
| Data Science | |T.Y.C. S| |2024-25|

B. Perform feature Dummification to convert categorical variables into numerical

representations.

Code:

import pandas as pd
iris=pd.read_csv("Iris.c
sv")print(iris)
from sklearn.preprocessing import
LabelEncoderle=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)

14
| Data Science | |T.Y.C. S| |2024-25|

15
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 4

Aim: Hypothesis Testing

A. Conduct a hypothesis test using appropriate statistical tests
(e.g., t-test, chi square test).

Code: -
# t-test
import numpy as np
from scipy import
stats
import matplotlib.pyplot as plt

# Generate two samples for demonstration

purposesnp.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2,
size=30)sample2 = np.random.normal(loc=12,
scale=2, size=30)

# Perform a two-sample t-test

t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Set the significance

levelalpha = 0.05

print("Results of Two-Sample t-
test:")print(f'T-statistic:
{t_statistic}') print(f'P-value:
{p_value}')

16
| Data Science | |T.Y.C. S| |2024-25|

print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

# Plot the distributions

plt.figure(figsize=(10, 6))

plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')

plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')

plt.axvline(np.mean(sample1), color='blue', linestyle='dashed',
linewidth=2) plt.axvline(np.mean(sample2), color='orange',
linestyle='dashed', linewidth=2)plt.title('Distributions of Sample 1
and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency’)
plt.legend()

# Highlight the critical region if null hypothesis is

rejectedif p_value < alpha:
critical_region = np.linspace(min(sample1.min(), sample2.min()),
max(sample1.max(),sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3,
label='Critical Region')plt.text(11, 5, f'T-statistic: {t_statistic:.2f}',
ha='center', va='center', color='black',
backgroundcolor='white')

# Show the plot

plt.show()

# Draw Conclusions if p_value < alpha:

if np.mean(sample1) > np.mean(sample2):

17
| Data Science | |T.Y.C. S| |2024-25|

print("Conclusion: There is significant evidence to reject the null hypothesis.")

print("Interpretation: The mean of Sample 1 is significantly higher than that of
Sample 2.")
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean of Sample 2 is significantly higher than that of
Sample
1.")
else:

print("Conclusion: Fail to reject the null hypothesis.")

print("Interpretation: There is not enough evidence to claim a significant
differencebetween the means.")

Output:

18
| Data Science | |T.Y.C. S| |2024-25|

#chi-test
import pandas as pd
import numpy as np
import matplotlib as
pltimport seaborn
as sb import
warnings

from scipy import stats

warnings.filterwarnings('ign
ore')
df=sb.load_dataset('mpg')
print(df)
print(df['horsepower'].descri
be())
print(df['model_year'].descri
be())bins=[0,75,150,240]
df['horsepower_new']=pd.cut(df['horsepower'],bins=bins,labels=['l','m','h'])
c=df['horsepower_new']
print(c)
ybins=[69,72,74,
84]
label=['t1','t2','t3']
df['modelyear_new']=pd.cut(df['model_year'],bins=ybins,labels=label
) newyear=df['modelyear_new']
print(newyear)

19
| Data Science | |T.Y.C. S| |2024-25|

df_chi=pd.crosstab(df['horsepower_new'],df['modelyear_ne
w']) print(df_chi)
print(stats.chi2_contingency(df
_chi)Output:

20
| Data Science | |T.Y.C. S| |2024-25|

Conclusion:
There is sufficient evidence to reject the null hypothesis, indicating that
there is a significant association between 'horsepower_new' and
'modelyear_new' categories.

21
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 5
Aim: - ANOVA (Analysis of Variance)
A. Perform one-way ANOVA to compare means across multiple groups.
B. Conduct post-hoc tests to identify significant differences between group means.

import pandas as pd
import scipy.stats as
stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

group1 = [23, 25, 29, 34, 30]

group2 = [19, 20, 22, 24, 25]
group3 = [15, 18, 20, 21, 17]
group4 = [28, 24, 26, 30, 29]

# Combine data into a DataFrame

data = pd.DataFrame({'value': group1 + group2 + group3 + group4,
'group': ['Group1'] * len(group1) + ['Group2'] * len(group2) +
['Group3'] * len(group3) + ['Group4'] * len(group4)})

# Perform one-way ANOVA

f_statistics, p_value = stats.f_oneway(group1, group2, group3, group4)
print("one-way ANOVA:")
print("F-statistics:", f_statistics)
print("p-value", p_value)

# Perform Tukey-Kramer post-hoc test

tukey_results = pairwise_tukeyhsd(data['value'], data['group'])
print("\nTukey-Kramer post-hoc test:")
print(tukey_results)

22
| Data Science | |T.Y.C. S| |2024-25|

Output:-

Conclusion

• F-statistic: This value indicates the ratio of the variance between groups to the
variance within groups. A larger F-statistic suggests that the means of the
groups are more different from each other compared to the variability within
each group.
• If the p-value is less than the chosen significance level (e.g., 0.05), it suggests
that there are significant differences among the group means.
• If the p-value is greater than the significance level, it suggests that there is
insufficient evidence to reject the null hypothesis, meaning there are no
significant differences among the group means.
• There are significant differences among the means of the groups.

23
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 6

Aim: - Regression and its Types.

A. Implement simple linear regression using a dataset.

import numpy as
npimport pandas
as pd
from sklearn.datasets import
fetch_california_housingfrom
sklearn.model_selection import
train_test_split from sklearn.linear_model
import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data,
columns=housing.feature_names)print(housing_df)

housing_df['PRICE'] = housing.target

X=
housing_df[['AveRooms']
]y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

random_state=42)

24
| Data Science | |T.Y.C. S| |2024-25|

model = LinearRegression()

model.fit(X_train, y_train)

mse = mean_squared_error(y_test,
model.predict(X_test))r2 = r2_score(y_test,
model.predict(X_test))

print("Mean Squared Error:",

mse) print("R-squared:", r2)
print("Intercept:",
model.intercept_)
print("Coefficient:",
model.coef_)

output:

25
| Data Science | |T.Y.C. S| |2024-25|

B. Extend the analysis to multiple linear regression and assess the impact of
additional predictors.

#Multiple Liner Regression

X = housing_df.drop('PRICE',axis=1 )y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

model = LinearRegression()
model.fit(X_train,y_train)

y_pred = model.predict(X_test)
mse =
mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

print("Mean Squared Error:",mse)

print("R-squared:",r2)
print("Intercept:",model.interc ept_)
print("Coefficient:",model.co ef_)

Output:

26
| Data Science | |T.Y.C. S| |2024-25|

Conclusion

The Mean Squared Error (MSE) is a commonly used metric to evaluate the
performance of regression models. It measures the average squared difference
between the predicted values and the actual values of the target variable. A lower
MSE value indicates that the model's predictions are closer to the actual values on
average, suggesting better performance.
R2 tells us how well the independent variables explain the variability of the
dependent variable. It ranges from 0 to 1, where: 0 indicates that the model does not
explain any of the variability of the dependent variable around its mean.1 indicates
that the model explains all the variability of the dependent variable around its mean.

The intercept represents the point where the regression line intersects the y-axis on
a graph. It provides information about the baseline value of the dependent variable
when all predictors are zero.
Coefficients represent the impact of changes in the independent variables on the
dependent variable in a linear regression model.

27
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 7
Aim: - Logistic Regression and Decision Tree

import numpy as
npimport pandas
as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import
train_test_split from sklearn.linear_model
import LogisticRegressionfrom sklearn.tree
import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score,
recall_score,classification_report
# Load the Iris dataset and create a binary
classification problemiris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
binary_df =
iris_df[iris_df['target'] != 2]X =
binary_df.drop('target', axis=1)
y = binary_df['target']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)# Train a logistic regression model and evaluate its
performance
logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train) y_pred_logistic
= logistic_model.predict(X_test)

28
| Data Science | |T.Y.C. S| |2024-25|

print("Logistic Regression Metrics")

print("Accuracy: ", accuracy_score(y_test,
y_pred_logistic)) print("Precision:",
precision_score(y_test, y_pred_logistic))
print("Recall: ", recall_score(y_test,
y_pred_logistic))

print("\nClassification Report") print(classification_report(y_test, y_pred_logistic))

# Train a decision tree model and evaluate its
performancedecision_tree_model =
DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree =
decision_tree_model.predict(X_test)
print("\nDecision Tree Metrics")
print("Accuracy: ", accuracy_score(y_test,
y_pred_tree))print("Precision:",
precision_score(y_test, y_pred_tree))
print("Recall: ", recall_score(y_test,
y_pred_tree)) print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))
Output:-

29
| Data Science | |T.Y.C. S| |2024-25|

Conclusion:

Precision: Precision measures the ratio of correctly predicted positive observations to the total
predicted positives. It indicates the accuracy of positive predictions made by the model. A higher precision indicates
fewer false positives. Ideally, precision should be as close to 1.0 as possible. A precision of 1.0 indicates that all
positive predictions made by the model are correct, with no false positives.
Recall (also called Sensitivity or True Positive Rate): Recall measures the
ratio of correctly predicted positive observations to all actual positives in the dataset.
It indicates the model's ability to identify all positive instances correctly. A higher
recall indicates fewer false negatives. an ideal recall score would also be 1.0. A recall
of 1.0 indicates that the model correctly identifies all positive instances in the dataset,
with no false negatives.
F1-score: The F1-score is the harmonic mean of precision and recall. It
provides a balance between precision and recall. F1-score reaches its best value at 1
and worst at 0. It provides a balance between precision and recall. A higher F1-score
indicates better overall performance, especially in scenarios where we want to
balance false positives and false negatives.
Support: Support is the number of actual occurrences of the class in the
specified dataset. It represents the number of samples in each class. The support
value for each class ideally reflects a well-distributed dataset, with enough samples
for each class.

30
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 8
Aim: - K-Means Clustering

import pandas as pd
from sklearn.preprocessing import
MinMaxScalerfrom sklearn.cluster import
KMeans
import matplotlib.pyplot as plt

data =
pd.read_csv("C:\\Users\Reape\Downloads\wholesale\wholesale.c
sv")data.head()

categorical_features = ['Channel', 'Region']

continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper',
'Delicassen']data[continuous_features].describe()

for col in categorical_features:

dummies = pd.get_dummies(data[col],
prefix = col)data = pd.concat([data,
dummies], axis = 1) data.drop(col, axis = 1,
inplace = True)
data.head()

mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)

31
| Data Science | |T.Y.C. S| |2024-25|

sum_of_squared_distances
= []K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)

plt.plot(K,sum_of_squared_distances,
'bx-') plt.xlabel('k')
plt.ylabel('sum_of_squared_distances'
) plt.title('elbow Mehtod for optimal
k') plt.show()

Output:

32
| Data Science | |T.Y.C. S| |2024-25|

Conclusion:

• The elbow method helps in determining the optimal number of clusters for the
dataset. The point where the rate of decrease in the sum of squared distances
significantly slows down suggests a suitable number of clusters.
• We conclude that the optimal number of clusters for the data is 5.
• The optimal number of clusters identified using this method can be used for
further analysis or segmentation of customers based on their purchasing
behavior.

33
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 9
Aim: - Principal Component Analysis (PCA)

import pandas as
pdimport numpy
as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import
StandardScalerfrom
sklearn.decomposition import PCA

iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] +['target'])
X = iris_df.drop('target',
axis=1)y = iris_df['target']

scaler = StandardScaler()
X_scaled =
scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio =
pca.explained_variance_ratio_

34
| Data Science | |T.Y.C. S| |2024-25|

plt.figure(figsize=(8, 6))
plt.plot(np.cumsum(explained_variance_ratio), marker='o',
linestyle='--')plt.title('Explained Variance Ratio')
plt.xlabel('Number of Principal
Components') plt.ylabel('Cumulative
Explained Variance Ratio')plt.grid(True)

plt.show()

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1
print(f"Number of principal components to explain 95% variance:
{n_components}")

pca =
PCA(n_components=n_components)
X_reduced =
pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50,
alpha=0.5)plt.title('Data in Reduced-dimensional Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal
Component 2')
plt.colorbar(label='Target')
plt.show()

35
| Data Science | |T.Y.C. S| |2024-25|

Output:

Conclusion:
In conclusion, the code demonstrates the effectiveness of PCA in reducing the
dimensionality of high-dimensional datasets while preserving essential information.
It provides a systematic approach to exploring and visualizing complex datasets,
thereby aiding in data analysis and interpretation.

36
| Data Science | |T.Y.C. S| |2024-25|

PRACTICAL 10
Aim: - Data Visualization and Storytelling

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Generate random data
np.random.seed(42) # Set a seed for reproducibility
# Create a DataFrame with random data
data = pd.DataFrame({
'variable1': np.random.normal(0, 1, 1000),
'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),
'variable3': np.random.normal(-1, 1.5, 1000),
'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2,
0.1]),
dtype='category')
})
# Create a scatter plot to visualize the relationship between two variables
plt.figure(figsize=(10, 6))
plt.scatter(data['variable1'], data['variable2'], alpha=0.5)
plt.title('Relationship between Variable 1 and Variable 2', fontsize=16)
plt.xlabel('Variable 1', fontsize=14)
plt.ylabel('Variable 2', fontsize=14)
plt.show()
# Create a bar chart to visualize the distribution of a categorical variable
plt.figure(figsize=(10, 6))
sns.countplot(x='category', data=data)
plt.title('Distribution of Categories', fontsize=16)
plt.xlabel('Category', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=45)
plt.show()
# Create a heatmap to visualize the correlation between numerical variables
plt.figure(figsize=(10, 8))
numerical_cols = ['variable1', 'variable2', 'variable3']
sns.heatmap(data[numerical_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap', fontsize=16)
plt.show()
# Data Storytelling
print("Title: Exploring the Relationship between Variable 1 and Variable 2")

37
| Data Science | |T.Y.C. S| |2024-25|

print("\nThe scatter plot (Figure 1) shows the relationship between Variable 1 and
Variable 2. ")
print("\nScatter Plot")
print("Figure 1: Scatter Plot of Variable 1 and Variable 2")
print("\nTo better understand the distribution of the categorical variable 'category',
we created a ")
print("\nBar Chart")
print("Figure 2: Distribution of Categories")
print("\nAdditionally, we explored the correlation between numerical variables using
a heatmap ")
print("\nHeatmap")
print("Figure 3: Correlation Heatmap")

print("\nIn summary, the visualizations and analysis provide insights into the
relationships ")

Output:

38
| Data Science | |T.Y.C. S| |2024-25|

39
| Data Science | |T.Y.C. S| |2024-25|

Compile Pillar 1 to 6 PCB 12 , Prelims Economy by Mrunal Sir
89% (27)
Compile Pillar 1 to 6 PCB 12 , Prelims Economy by Mrunal Sir
1,209 pages
American Holocaust
100% (2)
American Holocaust
134 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Huaming OnLoadTapChanger Manufacturer
No ratings yet
Huaming OnLoadTapChanger Manufacturer
64 pages
Adasa Edwards - Project Proposal-Mba Inventory Managment
100% (1)
Adasa Edwards - Project Proposal-Mba Inventory Managment
39 pages
ds
No ratings yet
ds
28 pages
Data Science Journal
No ratings yet
Data Science Journal
40 pages
data science practicals
No ratings yet
data science practicals
47 pages
omkar
No ratings yet
omkar
37 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
DS Practical (BSC CS)
No ratings yet
DS Practical (BSC CS)
49 pages
Data science Lab manual -EDITED_MAIN
No ratings yet
Data science Lab manual -EDITED_MAIN
34 pages
Data science Lab manual (1)
No ratings yet
Data science Lab manual (1)
33 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
TYCS Data Science Manual
No ratings yet
TYCS Data Science Manual
44 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
TYCS Data Science Manual
No ratings yet
TYCS Data Science Manual
44 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
DATASCIENCE PRACTICALS
No ratings yet
DATASCIENCE PRACTICALS
23 pages
PracticalList_EDT_BCA_2024 SET B1_4
No ratings yet
PracticalList_EDT_BCA_2024 SET B1_4
8 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
data-science-practical-with-solutions-bsc-cs-sem-6
No ratings yet
data-science-practical-with-solutions-bsc-cs-sem-6
29 pages
Practicals
No ratings yet
Practicals
42 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
DS Journal_Final
No ratings yet
DS Journal_Final
37 pages
DS Journal-1
No ratings yet
DS Journal-1
25 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
ds
No ratings yet
ds
114 pages
Problem Set 4 .Docx
No ratings yet
Problem Set 4 .Docx
2 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
Ds Practical Final
No ratings yet
Ds Practical Final
25 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Buy ebook Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini cheap price
100% (18)
Buy ebook Data Science Fundamentals with R Python and Open Data 1st Edition Marco Cremonini cheap price
71 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Data Science
No ratings yet
Data Science
30 pages
3-DSEs UGCF CS (H) Approved Facultymay25
No ratings yet
3-DSEs UGCF CS (H) Approved Facultymay25
44 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Disha Data Science
No ratings yet
Disha Data Science
27 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
vamshi ml-1,2
No ratings yet
vamshi ml-1,2
25 pages
DATA_SCIENCE_MANAUL (TE) (1)
No ratings yet
DATA_SCIENCE_MANAUL (TE) (1)
78 pages
Datamites Certified Data Scientist Syllabus PDF
50% (2)
Datamites Certified Data Scientist Syllabus PDF
12 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
IDS Syllabus
No ratings yet
IDS Syllabus
3 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
25 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
AY2023 Sem 1 A218 MSA Revision - L01 To L06
No ratings yet
AY2023 Sem 1 A218 MSA Revision - L01 To L06
28 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (7)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
OJT-Field Report - Research Project Format 2025
No ratings yet
OJT-Field Report - Research Project Format 2025
9 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
AL Notes
No ratings yet
AL Notes
61 pages
Udacity Enterprise Syllabus Data Analyst nd002
No ratings yet
Udacity Enterprise Syllabus Data Analyst nd002
16 pages
Foundation of Data Science previous year question paper
No ratings yet
Foundation of Data Science previous year question paper
40 pages
Exploring AutoCAD Map 3D 2022, 9th Edition
From Everand
Exploring AutoCAD Map 3D 2022, 9th Edition
Prof. Sham Tickoo
No ratings yet
Learning Oracle 12c: A PL/SQL Approach
From Everand
Learning Oracle 12c: A PL/SQL Approach
Prof. Sham Tickoo
No ratings yet
2019 G6NA Mathematics Paper 2
No ratings yet
2019 G6NA Mathematics Paper 2
13 pages
Air Circuit Breakers
No ratings yet
Air Circuit Breakers
13 pages
CTC 177 PDF
No ratings yet
CTC 177 PDF
131 pages
Resume Anitha (1 PDF
No ratings yet
Resume Anitha (1 PDF
2 pages
Allison Transmission
100% (1)
Allison Transmission
74 pages
Integrating Refrigerating Compressor Trains For Large LNG Plants
No ratings yet
Integrating Refrigerating Compressor Trains For Large LNG Plants
2 pages
Informatica PowerCenter Scenario-II
No ratings yet
Informatica PowerCenter Scenario-II
9 pages
Pre Qualification Dossier Phase 1
No ratings yet
Pre Qualification Dossier Phase 1
15 pages
Black Eyed Peas - The E.N.D. - Digital Booklet
50% (2)
Black Eyed Peas - The E.N.D. - Digital Booklet
16 pages
Product Data Sheet: Acti9 iID - RCCB - 2P - 25A - 30ma - Type A-SI
No ratings yet
Product Data Sheet: Acti9 iID - RCCB - 2P - 25A - 30ma - Type A-SI
2 pages
How To Write A Good: Gursewak Singh
No ratings yet
How To Write A Good: Gursewak Singh
25 pages
A New Vertical Tail Design
No ratings yet
A New Vertical Tail Design
28 pages
132KV Isolator Catalog
0% (1)
132KV Isolator Catalog
3 pages
Diamond Antenna - A144s5
No ratings yet
Diamond Antenna - A144s5
2 pages
DLL Matatag - Science 4 q1 w1 - 085011
No ratings yet
DLL Matatag - Science 4 q1 w1 - 085011
22 pages
1 - SMMS e & MINI SMMS e Installation Method Statement
No ratings yet
1 - SMMS e & MINI SMMS e Installation Method Statement
4 pages
Sujatha, T., & Basanthi, D. (2014) Modified Reactive Powder Concrete.
No ratings yet
Sujatha, T., & Basanthi, D. (2014) Modified Reactive Powder Concrete.
3 pages
Longitudinal: Normal Beam Probe For Contact Testing
No ratings yet
Longitudinal: Normal Beam Probe For Contact Testing
1 page
LEL Vs Percent Volume
No ratings yet
LEL Vs Percent Volume
3 pages
Report On Making Pakistani Engineering Products Compatible in The World Market
No ratings yet
Report On Making Pakistani Engineering Products Compatible in The World Market
5 pages
("Mobile Communication (India) Pvt.ltd. दारा ििये गये Intelli े े
100% (2)
("Mobile Communication (India) Pvt.ltd. दारा ििये गये Intelli े े
6 pages
DS PDSCope Eng
No ratings yet
DS PDSCope Eng
1 page
Journal of Services Marketing
No ratings yet
Journal of Services Marketing
7 pages
PARTS MANUAL - Serial No 25xxxx and Greater - 06-2021
No ratings yet
PARTS MANUAL - Serial No 25xxxx and Greater - 06-2021
27 pages
1.4.1 Ecology Worksheet
100% (2)
1.4.1 Ecology Worksheet
5 pages
Question Paper
No ratings yet
Question Paper
2 pages

Data Science Practicals

Uploaded by

Data Science Practicals

Uploaded by

Habib Education and Welfare Society’s

M.S. COLLEGE OF SCIENCE, ARTS, COMMERCE, BSC

(A.Y. 2024 – 2025)

This is to certify that Mr. / Miss.

Internal Examiner Subject Teacher

DATE OF SUBMISSION: COLLEGE SEAL

c. Use VLOOKUP function to retrieve information from a

a. Implement simple linear regression using a dataset.

9. Principal Component Analysis (PCA)

Aim: Introduction to Excel:

A. Perform conditional formatting on a dataset using various criteria.

Step 3: Go to Data Bars > Solid Fill in conditional formatting.

B. Create a pivot table to analyses and summarize data.

Step 3: Select and drag attributes in the below boxes.

C. Use VLOOKUP function to retrieve information from a

Step 1: click on an empty cell and type the following command.

D. Perform what-if analysis using Goal Seek to determine input

Aim: Data Frames and Basic Data Pre-processing

B. Perform basic data pre-processing tasks such as handling missing

print("Dataset after dropping NA

# Load iris dataset

Aim: Feature Scaling and Dummification

B. Perform feature Dummification to convert categorical variables into numerical

Aim: Hypothesis Testing

# Generate two samples for demonstration

# Perform a two-sample t-test

# Set the significance

print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

# Plot the distributions

plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')

plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')

# Highlight the critical region if null hypothesis is

# Show the plot

# Draw Conclusions if p_value < alpha:

print("Conclusion: There is significant evidence to reject the null hypothesis.")

print("Conclusion: Fail to reject the null hypothesis.")

from scipy import stats

group1 = [23, 25, 29, 34, 30]

# Combine data into a DataFrame

# Perform one-way ANOVA

# Perform Tukey-Kramer post-hoc test

Aim: - Regression and its Types.

A. Implement simple linear regression using a dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

print("Mean Squared Error:",

#Multiple Liner Regression

print("Mean Squared Error:",mse)

print("Logistic Regression Metrics")

print("\nClassification Report") print(classification_report(y_test, y_pred_logistic))

categorical_features = ['Channel', 'Region']

for col in categorical_features:

You might also like