0% found this document useful (0 votes)
267 views49 pages

DS Practical (BSC CS)

Data science practicals

Uploaded by

theforwardko9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
267 views49 pages

DS Practical (BSC CS)

Data science practicals

Uploaded by

theforwardko9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

DATA SCIENCE

T.Y.B.Sc.CS (COMPUTER SCIENCE)

BY

Name: MALI KRISHNA VINOD

Seat No: 1110192

N.B. MEHTA (VALWADA) SCIENCE COLLEGE BORDI


MAHARASHTRA – 401701

DEPARTMENT OF INFORMATION TECHNOLOGY

T.Y.B.Sc.CS (COMPUTER SCIENCE)


Semester-6

Academic year 2023-24


CERTIFICATE

Class: B.Sc. Computer Science (Semester 6)


Year: 2023-2024

This is to certify that the work entered in this journal is the work of Shri MALI
KRISHNA VINOD of T.Y.B.Sc.CS division Computer Science Roll No. Uni. Exam No has
satisfactorily completed the required number of practical and worked for the 2 nd term of the
Year 2023-24 in the college laboratory as laid down by the university.

______________________ _____________________ ____________________


Head of the External Internal Examiner
Department Examiner Subject teacher

Date: / / 2024 Department of IT-CS


T.Y.B.Sc.(Computer Science) DATA SCIENCE

INDEX

Sr. Date of Date of


Title Sign
No. Experiment Submission

1. Introduction to Excel.

2. Data Frames and Basic Data Pre-processing.

3. Feature Scaling and Dummification.

4. Hypothesis Testing.

5. ANOVA (Analysis of Variance).

6. Regression and Its Types.

7. Logistic Regression and Decision Tree.

8. K-Means Clustering.

9. Principal Component Analysis (PCA).

N.B. Mehta College Page |1


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 01

Aim: - Introduction to Excel.

 Perform conditional formatting on a dataset using various criteria.

We will take the following marksheet data set to perform Conditional Formatting.

1. Highlighting Cells Rules.

Step 1: Select the column ‘Percentage’ in your Excel sheet and click on ‘Conditional
Formatting’ then click on ‘Highlight Cells Rules’ and then click on ‘Greater Than’.

N.B. Mehta College Page |2


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Step 2: A box will appear where you will enter the value and select a formatting type and
then click on OK.

The formatting will appear as follows which highlights the cells which are above the value
we gave for formatting and the formatting type we gave.

N.B. Mehta College Page |3


T.Y.B.Sc.(Computer Science) DATA SCIENCE

2. Top/Bottom Rules.

Step 1: Select the column ‘Total’ in your Excel sheet and click on ‘Conditional Formatting’
then click on ‘Top/Bottom Rules’ and then click on ‘Top 10 %’.

Step 2: A box will appear where you will enter the percentage of top values you want to
display and select a formatting type and then click on OK.

The formatting will appear as follows which highlights the cells which are the top 20 percent
of all the values from the column with the formatting type we gave.

N.B. Mehta College Page |4


T.Y.B.Sc.(Computer Science) DATA SCIENCE

3. Data Bars.

Step 1: Select the column ‘Sub5’ in your Excel sheet and click on ‘Conditional Formatting’
then click on ‘Data Bars’ and then select whichever formatting you like.

This gives us the formatting of Data Bars according to the data available as shown above.

4. Color Scales.

Step 1: Select the column ‘Sub4’ in your Excel sheet and click on ‘Conditional Formatting’
then click on ‘Color Scales’ and then select whichever formatting you like.

This applies a color gradient to a range of cells as shown above.

N.B. Mehta College Page |5


T.Y.B.Sc.(Computer Science) DATA SCIENCE

5. Icon Sets.

Step 1: Select the column ‘Sub3’ in your Excel sheet and click on ‘Conditional Formatting’
then click on ‘Icon Set’ and then select whichever formatting you like.

In the above figure you can see the marks of Sub3 have been rated in the form of 4 bars
rating.

N.B. Mehta College Page |6


T.Y.B.Sc.(Computer Science) DATA SCIENCE

 Create a pivot table to analyze and summarize data.

To create a pivot table, we are going to take the following data set of product orders.

Step 1: To create a pivot table, select the rows and columns, go to the ‘INSERT’ menu and
click on ‘Pivot Table’ in tables section.

N.B. Mehta College Page |7


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Step 2: A dialog box will appear to create pivot table. Just click on OK.

This will create a blank pivot table along with the "PivotTable Field List" pane.

N.B. Mehta College Page |8


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Let’s do an analysis of ranking the top 5 products by revenue.

Step 1: Drag the "Product" field to the Rows area, and the "Price" field to the Values area.
This will give you the total revenue generated by each product.

Step 2: Right-click on any cell in the ‘Price’ column within the pivot table. Select ‘Show
Values As’ from the context menu. From the dropdown menu, select ‘Rank Largest to
Smallest’.

N.B. Mehta College Page |9


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Step 3: A dialog box will appear to select the base field. Product will be the default field keep
it as it is and click on ‘OK’.

As we can see our data is ranked based on the revenue.

Step 4: Now to see the top 5 ranked products Click on the drop-down arrow next to ‘Row
Labels’ in the pivot table. Select ‘Value Filters’ from the drop-down menu. Choose ‘Top
10...’ or any other number you prefer. Enter the number ‘5’ as we want to display top 5
selling products. Click ‘OK’.

N.B. Mehta College P a g e | 10


T.Y.B.Sc.(Computer Science) DATA SCIENCE

As we see the top 5 products are given below.

If we want to see the sum of these products again, we can just right click on any cell of the
‘Sum of price’ column select ‘Show Values As’ and click on ‘No calculations’.

As we can see we got the sum of price of the top 5 products.

N.B. Mehta College P a g e | 11


T.Y.B.Sc.(Computer Science) DATA SCIENCE

 Use VLOOKUP function to retrieve information from a different


worksheet or table.

The Data set we are going to use for VLOOKUP function is as follows which gives us some
stats of football players.

To create a VLOOKUP function, we’ll just copy the columns as follows.

We’ll just enter any ID from the IDs available. Then we click on the cell under the Player
column and click on Insert Function.

N.B. Mehta College P a g e | 12


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Select the ‘VLOOKUP’ function and click on ‘OK’.

Now in the new dialog box there are 4 fields to be filled. In ‘Lookup_value’ select the Id
column where you’ll put the Id of the player you wish to see. In ‘Table_array’ select the
complete table of the players. In ‘column_index_num’ select the index value of the column.
The index value of ‘Player’ column is 2. ‘Range_lookup’ will be ‘false’ to fetch the exact
value. Then click on ‘Ok’.

Similarly do this for all the other columns as well.

N.B. Mehta College P a g e | 13


T.Y.B.Sc.(Computer Science) DATA SCIENCE

As we can see we can fetch data of the players from their ID with the help of VLOOKUP.

N.B. Mehta College P a g e | 14


T.Y.B.Sc.(Computer Science) DATA SCIENCE

 Perform what-if analysis using Goal Seek to determine input values


for desired output.

We are going to perform what-if analysis on the following data-set.

Now we’ll select a cell in which you want to perform goal seek analysis. Go to the Data tab
and click on What-If Analysis in the Forecast section and click on Goal Seek.

N.B. Mehta College P a g e | 15


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Enter the value you want to achieve and enter select the cell on whose basis you want to
achieve the value and click on Ok.

You can see the status of the goal reached. Click on Ok.

As you can see the changes are made and the goal is reached.

N.B. Mehta College P a g e | 16


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 02

Aim: - Data Frames and Basic Data Pre-processing.

 Read data from CSV and JSON files into a data frame.

1. Reading Data from Csv file.

To read data from the Csv file, we need a Csv file. We are going to use the following
‘emplyoee_data.csv’ file for reading its data.

We are using Google Colab to read the above csv file. For that we are going to upload this
‘employee_data.csv’ file in Google Colab.

Python Code for Reading ‘Employee_data.csv’ File.

N.B. Mehta College P a g e | 17


T.Y.B.Sc.(Computer Science) DATA SCIENCE

2. Reading Data from JSON file.

To read data from the Csv file, we need a Csv file. We are going to use the following
‘emplyoee_data.json’ file for reading its data.

N.B. Mehta College P a g e | 18


T.Y.B.Sc.(Computer Science) DATA SCIENCE

We are using Google Colab to read the above json file. For that we are going to upload this
‘employee_data.json’ file in Google Colab.

Python Code for Reading ‘Employee_data.json’ File.

N.B. Mehta College P a g e | 19


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 20


T.Y.B.Sc.(Computer Science) DATA SCIENCE

 Perform basic data pre-processing tasks such as handling missing


values and outliers.
 Manipulate and transform data using functions like filtering, sorting,
and grouping.

We use the following CSV and JSON file to perform the above tasks.

CSV file: -

N.B. Mehta College P a g e | 21


T.Y.B.Sc.(Computer Science) DATA SCIENCE

JSON file: -

We are using Google Colab to perform the above tasks. For that we are going to upload these
‘sales_data.csv’ file and ‘sales_data.json’ in Google Colab.

N.B. Mehta College P a g e | 22


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Python Code for the above tasks: -

N.B. Mehta College P a g e | 23


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 24


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 03

Aim: - Feature Scaling and Dummification

 Apply feature-scaling techniques like standardization and


normalization to numerical features.

To perform feature-scaling techniques like standardization and normalization to numerical


features we need a dataset. Hence, we are going to use the ‘wine.csv’ dataset.

Wine.csv: -

Python Code to perform Feature-Scaling: -

N.B. Mehta College P a g e | 25


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 26


T.Y.B.Sc.(Computer Science) DATA SCIENCE

 Perform feature dummification to convert categorical variables into


numerical representations.

To perform feature-scaling techniques like standardization and normalization to numerical


features we need a dataset. Hence, we are going to use the ‘wine.csv’ dataset.

Iris.csv: -

Python Code to perform Dummification: -

N.B. Mehta College P a g e | 27


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 28


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 04

Aim: - Hypothesis Testing

 Formulate null and alternative hypotheses for a given problem.


 Conduct a hypothesis test using appropriate statistical tests (e.g., t-
test, chi-square test).
 Interpret the results and draw conclusions based on the test
outcomes.

1. T-test.

Description: -

The aim of the program is to demonstrate the process of conducting a two-sample t-test and
drawing conclusions based on the results. Specifically, it aims to compare the means of two
samples (sample1 and sample2) drawn from normal distributions with different means but the
same standard deviation.

Detailed Breakdown: -

a) Generate Samples: The program generates two samples, each representing a different
population or group. These samples are generated from normal distributions with
means of 10 and 12, and a standard deviation of 2.
b) Perform Hypothesis Test: The program conducts a two-sample t-test to determine
whether there is a statistically significant difference between the means of the two
samples.
c) Set Significance Level: It sets a significance level (alpha) at 0.05, which is a common
threshold used in hypothesis testing.
d) Visualize Distributions: The program plots histograms of the two samples to visualize
their distributions and compare their means visually.

N.B. Mehta College P a g e | 29


T.Y.B.Sc.(Computer Science) DATA SCIENCE

e) Highlight Critical Region: If the p-value from the t-test is less than the significance
level, the program highlights the critical region on the plot to indicate where the
observed difference in means is statistically significant.
f) Draw Conclusions: Based on the results of the t-test, the program draws conclusions
about whether there is significant evidence to reject the null hypothesis (i.e., the
means of the two populations are equal) and provides interpretations of the findings
based on the direction of the difference in means, if applicable.

Python Code to perform T-test: -

N.B. Mehta College P a g e | 30


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 31


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Conclusion: -

Based on the results of the two-sample t-test:


1. If p-value < alpha (0.05):
 If the mean of Sample 1 is significantly higher than that of Sample 2:
 Conclusion: There is significant evidence to reject the null hypothesis.
 Interpretation: The mean of Sample 1 is significantly higher than that
of Sample 2.
 If the mean of Sample 2 is significantly higher than that of Sample 1:
 Conclusion: There is significant evidence to reject the null hypothesis.
 Interpretation: The mean of Sample 2 is significantly higher than that
of Sample 1.
2. If p-value ≥ alpha (0.05):
 Conclusion: Fail to reject the null hypothesis.
 Interpretation: There is not enough evidence to claim a significant difference
between the means of the two samples.
These conclusions are drawn based on the comparison of p-value with the chosen
significance level (alpha) and provide insights into whether there is significant evidence to
support the alternative hypothesis, which posits a difference between the means of the two
populations.

N.B. Mehta College P a g e | 32


T.Y.B.Sc.(Computer Science) DATA SCIENCE

2. Chi-square Test.

Description: -

We apply chi square test to check if there is correlation among the given two categorical
variables.

Assumptions: -

 Observation in each sample is independent and identically distributed.


 Expected frequencies should be at least 5 for the majority (80%) of the cells.
 Two categorical variables

We are going to perform Chi-square Test on the following ‘mpg.csv’ dataset.

N.B. Mehta College P a g e | 33


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Python Program to perform Chi-square Test: -

N.B. Mehta College P a g e | 34


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Conclusion: -

There is sufficient evidence to reject the null hypothesis, indicating that there is a significant
association between 'horsepower_new' and 'modelyear_new' categories.

N.B. Mehta College P a g e | 35


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 05

Aim: - ANOVA (Analysis of Variance)

 Perform one-way ANOVA to compare means across multiple groups.


 Conduct post-hoc tests to identify significant differences between
group means.

Python Program to perform ANOVA (Analysis of Variance): -

N.B. Mehta College P a g e | 36


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 06

Aim: - Regression and Its Types

 Implement simple linear regression using a dataset.


 Explore and interpret the regression model coefficients and
goodness-of-fit measures.
 Extend the analysis to multiple linear regression and assess the
impact of additional predictors.

Python Program to perform Regression: -

N.B. Mehta College P a g e | 37


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 38


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 07

Aim: - Logistic Regression and Decision Tree

 Build a logistic regression model to predict a binary outcome.


 Evaluate the model's performance using classification metrics (e.g.,
accuracy, precision, recall).
 Construct a decision tree model and interpret the decision rules for
classification.

Python Program: -

N.B. Mehta College P a g e | 39


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 40


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 08

Aim: - K-Means Clustering

 Apply the K-Means algorithm to group similar data points into


clusters.
 Determine the optimal number of clusters using elbow method or
silhouette analysis.
 Visualize the clustering results and analyze the cluster
characteristics.

We are going to perform K-Means Clustering on the following ‘Wholesale.csv’ dataset.

N.B. Mehta College P a g e | 41


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 42


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Python Program to perform K-Means Clustering: -

N.B. Mehta College P a g e | 43


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 44


T.Y.B.Sc.(Computer Science) DATA SCIENCE

Practical No. 09

Aim: - Principal Component Analysis (PCA)

 Perform PCA on a dataset to reduce dimensionality.


 Evaluate the explained variance and select the appropriate number
of principal components.
 Visualize the data in the reduced-dimensional space.

Python Program for PCA: -

N.B. Mehta College P a g e | 45


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 46


T.Y.B.Sc.(Computer Science) DATA SCIENCE

N.B. Mehta College P a g e | 47

You might also like