100% found this document useful (1 vote)

50 views29 pages

DS Manual

Uploaded by

yashaswi1918

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

50 views29 pages

DS Manual

Uploaded by

yashaswi1918

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Data Science and its applications Lab (21AD62)

DATA SCIENCE AND ITS APPLICATIONS

Subject code: 21AD62 IA Marks: 20

Hour/week: 02 Total Hours: 20

Course Outcomes:
After the completion of the course, the students will be able to:
CO1 – Demonstrate the proficiency with statistical analysis of data to derive insight from results and
interpret the data findings visually.
CO2 – Interpret the skills in data management by obtaining, cleaning and transforming the data.
CO3 – Applying machine learning models to solve the business-related challenges.
CO4 – Executing decision trees, neural network layers and data partition.
CO5 – Estimating how social clustering shape individuals and groups in contemporary society.

Sl List of Experiments CO’s Page

No. no
1 Installation of Python/R language, Visual Studio code editors can be CO1 4-6
demonstrated along with Kaggle data set usage
2 Write programs in Python/R and Execute them in either Visual Studio Code or CO1 7
PyCharm Community Edition or any other suitable environment
3 A study was conducted to understand the effect of number of hours the students CO1 8
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on
y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title

4 For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a CO2 9

histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)
5 Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle CO2 10-12
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which
contains information about books. Write a program to demonstrate the
following.
 Import the data into a Data Frame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the Data Frame
 Tidy up fields in the data such as date of publication with the help of simple
regular expression.
 Combine str methods with NumPy to clean columns
6 Train a regularized logistic regression classifier on the iris dataset CO3 13

Dept. of Computer Science & Engineering (AI & ML) Page No: 1
Data Science and its applications Lab (21AD62)

(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris

dataset) using sklearn. Train the model with the following hyperparameter C =
1e4 and report the best classification accuracy.

7 Train an SVM classifier on the iris dataset using sklearn. Try different kernels CO3 14-15
and the associated hyperparameters. Train model with the following set of
hyperparameters RBF kernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data.

8 Consider the following dataset. Write a program to demonstrate the working of CO4 16-20
the decision tree based ID3 algorithm.

9 Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in CO4 21-23
the dataset corresponds to the co-ordinates of each data point. The third column
corresponds to the actual cluster label. Compute the rand index for the following
methods
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering.
Also visualize the dataset and which algorithm will be able to recover the true
clusters.

10 Mini Project – Simple web scrapping in social media.

CONTENT BEYOND SYLLABUS

Dept. of Computer Science & Engineering (AI & ML) Page No: 2
Data Science and its applications Lab (21AD62)

Sl List of Experiment CO’s Page

No. no

1 Write a program to demonstrate Regression analysis with residual plots CO1 24-25
on a given data set.

2 Write a program to visualize the probability distribution of a sample CO3 26

against a single continuous attribute.

Dept. of Computer Science & Engineering (AI & ML) Page No: 3
Data Science and its applications Lab (21AD62)

1) Installation of Python/R language, Visual Studio code editors can be

demonstrated along with Kaggle data set usage

Steps:
Installation of Python:
Download Python

 Go to the official Python website: Python.org.

 Download the latest version of Python for your operating system (Windows, macOS, or Linux).
Install Python:

 Run the downloaded installer.

 Check the box that says "Add Python to PATH" during installation.
 Follow the installation wizard.
Verify Installation:

 Open a terminal or command prompt.

 Type python --version and press Enter to install Python.
Installation of R Language:

Download R:

 Go to the official R Project website: R-Project.org.

 Download the appropriate installer for operating system.
Install R:

 Run the downloaded installer.

 Follow the installation instructions provided by the installer.

Verify Installation:

 Open a terminal or command prompt.

 Type R --version and press Enter to install R .

Installation of Visual Studio Code:

Download Visual Studio Code:

 Go to the official Visual Studio Code website: code.visualstudio.com.

 Download the installer for your operating system.
Install Visual Studio Code:

 Run the downloaded installer.

 Follow the installation instructions provided by the installer.

Dept. of Computer Science & Engineering (AI & ML) Page No: 4
Data Science and its applications Lab (21AD62)

Install Python and R Extensions:

 Open Visual Studio Code after installation.

 Go to the Extensions view by clicking on the square icon on the sidebar or pressing
Ctrl+Shift+X.
 Search for "Python" in the Extensions Marketplace and install the Python extension provided by
Microsoft.
 Search for "R" in the Extensions Marketplace and install the R extension provided by Yuki
Ueda.

Using Kaggle Datasets:

Sign up/Login to Kaggle:

 Go to the Kaggle website: Kaggle.com.

 Sign up for a Kaggle account .
Explore Datasets:

 Once logged in, navigate to the "Datasets" section from the top menu.
 Explore the available datasets by browsing or searching for specific topics.
Download Datasets:

 Click on a particular dataset.

 On the dataset page, find a "Download" button.
 Download the dataset in a format suitable for analysis (usually CSV or other common formats).
Import Datasets in Python or R:

In Python or R script within Visual Studio Code, the libraries are like pandas (for Python) or readr (for
R) to import the downloaded datasets and start analyzing them.

Import and Analyse Kaggle Dataset in Python:

Python script in Visual Studio Code, import pandas and read the downloaded CSV file:

Code:

import pandas as pd
print(pd.__version__)
import pandas as pd
# Read the dataset
df = pd.read_csv("desktop/water-quality-1.csv")
# Display the first few rows of the dataset
print(df.head())

Dept. of Computer Science & Engineering (AI & ML) Page No: 5
Data Science and its applications Lab (21AD62)

 Save your scripts.

 Run each script separately by clicking on the "Run" button in Visual Studio Code or by using the
appropriate keyboard shortcut (usually Ctrl+Enter).

Output:
Sample ID Grab ID Profile ID Sample Number Collect DateTime \
0 16316 16316.0 10702 9209019 04/13/1992 12:00:00 AM
1 8937 8937.0 37688 7915489 06/20/1979 12:00:00 AM
2 137745 137745.0 54368 L58228-1 06/25/2013 08:09:00 AM
3 131816 131816.0 50605 L55068-6 02/13/2012 09:38:00 AM
4 82325 82325.0 43896 L52933-87 03/30/2011 02:36:00 PM

Depth (m) Site Type Area Locator \

0 1.0 Streams and Rivers Pipers KSHZ06
1 1.0 Streams and Rivers Crisp 0321
2 1.0 Large Lakes Lake Union/Ship Canal 0512
3 1.0 Large Lakes Lake Union/Ship Canal 0540
4 4.2 Large Lakes Lake Washington 0804

Site ... MDL RDL \

0 Pipers Creek mouth ... NaN NaN
1 Crisp Creek mouth at SE Green Valley Rd ... NaN NaN
2 Ship Canal above locks ... NaN NaN
3 Ship Canal near Montlake Bridge ... 0.002 0.005
4 Lake Washington north end ... NaN NaN

Text Value Sample Info Steward Note \

0 .070||King County Nstream Database/B53311 NaN NaN
1 .727||King County Nstream Database/RS2 NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

Replicates Replicate Of Method Date Analyzed Data Source

0 NaN NaN none NaN KCEL
1 NaN NaN NaN NaN KCEL
2 NaN NaN HYDROLAB 06/25/2013 KCEL
3 NaN NaN SM4500-P-F 02/15/2012 KCEL
4 NaN NaN HYDROLAB NaN KCEL

[5 rows x 25 columns]

Dept. of Computer Science & Engineering (AI & ML) Page No: 6
Data Science and its applications Lab (21AD62)

2) Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment.

Example: Write Python code to find factorial of a number

Code:
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
num = int(input("Enter a number: "))
print("Factorial of", num, "is", factorial(num))

Output:
Enter a number: 5
Factorial of 5 is 120

Enter a number: 500

Factorial of 500 is
1220136825991110068701238785423046926253574342803192842192413588385845373153881
9976054964475022032818630136164771482035841633787220781772004807852051593292854
7790757193933060377296085908627042917454788242491272634430567017327076946106280
2310452644218878789465754777149863494367781037644274033827365397471386477878495
4384895955375379904232410612713269843277457155463099772027810145610811883737095
3101635632443298702956389662891165897476957208792692887128178007026517450776841
0719624390394322536422605234945850129918571501248706961568141625359056693423813
0088562492468915641267756544818865065938479517753608940057452389403357984763639
4490531306232374906644504882466507594673586207463792518420045936969298102226397
1952597190945217823331756934581508552332820762820023402626907898342451712006207
7146409794561161276291459512372299133401695523638509428855920187274337951730145
8635757082835578015873543276888868012039988238470215146760544540766353598417443
0480128938313896881639487469658817504506926365338175055478128640000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000

Dept. of Computer Science & Engineering (AI & ML) Page No: 7
Data Science and its applications Lab (21AD62)

3) A study was conducted to understand the effect of number of hours the student
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on
y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title

Number of hrs 10 9 2 15 10 16 11 16
spent studying
(x)
Score in the 95 80 10 50 45 98 38 93
final
exam (0 – 100)
(y)

Code:

import matplotlib.pyplot as plt

# Data
hours_studied = [10, 9, 2, 15, 10, 16, 11, 16]
exam_scores = [95, 80, 10, 50, 45, 98, 38, 93]
# Plotting
plt.plot(hours_studied, exam_scores, marker='*', color='red', linestyle='-')
plt.xlabel('Number of hrs spent studying')
plt.ylabel('Score in the final exam (0 - 100)')
plt.title('Relationship between Hours Studied and Exam Scores')
plt.grid(True)
# Show plot
plt.show()

Output:

Dept. of Computer Science & Engineering (AI & ML) Page No: 8
Data Science and its applications Lab (21AD62)

4) For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a

histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)

Code:

import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
mtcars = pd.read_csv('desktop/mtcars.csv')
# Plot histogram
plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')
# Add labels and title
plt.xlabel('Miles per gallon')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Miles per Gallon')
# Show plot
plt.show()
Output:

Dept. of Computer Science & Engineering (AI & ML) Page No: 9
Data Science and its applications Lab (21AD62)

5) Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle

(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following.

Steps:

 Import the data into a DataFrame

 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple regular expression.
 Combine str methods with NumPy to clean columns

Code:
import pandas as pd
import numpy as np

# Import the data into a DataFrame

books_df = pd.read_csv('desktop/BL-Flickr-Images-Book.csv')

# Display the first few rows of the DataFrame

print("Original DataFrame:")
print(books_df.head())

# Find and drop the columns which are irrelevant for the book information
columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former
owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=columns_to_drop, inplace=True)

# Change the Index of the DataFrame

books_df.set_index('Identifier', inplace=True)

# Tidy up fields in the data such as date of publication with the help of simple regular
expression
def clean_date(date):
if isinstance(date, str):
match = re.search(r'\d{4}', date)
if match:
return match.group()
return np.nan
books_df['Date of Publication'] = books_df['Date of Publication'].apply(clean_date)

# Combine str methods with NumPy to clean columns

books_df['Place of Publication'] = np.where(

Dept. of Computer Science & Engineering (AI & ML) Page No: 10
Data Science and its applications Lab (21AD62)

books_df['Place of Publication'].str.contains('London'),
'London',
np.where(
books_df['Place of Publication'].str.contains('Oxford'),
'Oxford',
books_df['Place of Publication'].replace(
r'^\s*$', 'Unknown', regex=True
)
)
)
# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(books_df.head())

Output:
Original DataFrame:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London

Date of Publication Publisher \

0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh

Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.

Contributors Corporate Author \

0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN

Dept. of Computer Science & Engineering (AI & ML) Page No: 11
Data Science and its applications Lab (21AD62)

Corporate Contributors Former owner Engraver Issuance type \

0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
2 NaN NaN NaN monographic
3 NaN NaN NaN monographic
4 NaN NaN NaN monographic

Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...

Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.

Cleaned DataFrame:
Place of Publication Date of Publication Publisher \Identifier
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh

Title Author Identifier

206 Walter Forbes. [A novel.] By A. A A. A.
216 All for Greed. [A novel. The dedication signed... A., A. A.
218 Love the Avenger. By the author of “All for Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.

Flickr URL
Identifier
206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...

Dept. of Computer Science & Engineering (AI & ML) Page No: 12
Data Science and its applications Lab (21AD62)

6) Train a regularized logistic regression classifier on the iris dataset

(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C = 1e4
and report the best classification accuracy.

Code:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Load the Iris dataset

iris = load_iris()
x = iris.data
y = iris.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with standardization and logistic regression

model = make_pipeline(StandardScaler(), LogisticRegression(C=1e4))

# Train the model

model.fit(X_train, y_train)

# Report the training accuracy

training_accuracy = model.score(X_train, y_train)
print(f"Training Accuracy: {training_accuracy}")

# Report the testing accuracy

testing_accuracy = model.score(X_test, y_test)
print(f"Testing Accuracy: {testing_accuracy}")

Output:
Training Accuracy: 0.9833333333333333
Testing Accuracy: 1.0

Dept. of Computer Science & Engineering (AI & ML) Page No: 13
Data Science and its applications Lab (21AD62)

7) Train an SVM classifier on the iris dataset using sklearn. Try different kernels
and the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data.

Code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameters

kernels = ['rbf']
gammas = [0.5]
Cs = [0.01, 1, 10]

# Initialize variables to store best accuracy and support vectors

best_accuracy = 0
best_support_vectors = None

# Iterate over different hyperparameters

for kernel in kernels:
for gamma in gammas:
for C in Cs:

# Create the SVM classifier with the specified hyperparameters

model = SVC(kernel=kernel, gamma=gamma, C=C, decision_function_shape='ovr')

Dept. of Computer Science & Engineering (AI & ML) Page No: 14
Data Science and its applications Lab (21AD62)

# Train the model

model.fit(X_train, y_train)

# Calculate accuracy on the test data

accuracy = model.score(X_test, y_test)

# Get the total number of support vectors

total_support_vectors = np.sum(model.n_support_)

# Update best accuracy and support vectors if current accuracy is better

if accuracy > best_accuracy:
best_accuracy = accuracy
best_support_vectors = total_support_vectors

# Print the current hyperparameters and corresponding accuracy

print(f"Kernel: {kernel}, Gamma: {gamma}, C: {C}, Accuracy: {accuracy}, Total Support
Vectors: {total_support_vectors}")

# Print the best accuracy and total number of support vectors

print(f"\nBest Accuracy: {best_accuracy}")
print(f"Total Support Vectors for Best Model: {best_support_vectors}")

Output:
Kernel: rbf, Gamma: 0.5, C: 0.01, Accuracy: 0.3, Total Support Vectors: 120
Kernel: rbf, Gamma: 0.5, C: 1, Accuracy: 1.0, Total Support Vectors: 39
Kernel: rbf, Gamma: 0.5, C: 10, Accuracy: 1.0, Total Support Vectors: 31

Best Accuracy: 1.0

Total Support Vectors for Best Model: 39

Dept. of Computer Science & Engineering (AI & ML) Page No: 15
Data Science and its applications Lab (21AD62)

8) Consider the following dataset. Write a program to demonstrate the working of

the decision tree based ID3 algorithm.

Code:
import numpy as np
class Node:
def __init__(self, feature=None, value=None, results=None, true_branch=None,
false_branch=None):
self.feature = feature # Feature to split on
self.value = value # Value of the feature
self.results = results # None for nodes, holds value for leaf nodes
self.true_branch = true_branch # Subtree for when the condition is true
self.false_branch = false_branch # Subtree for when the condition is false

def unique_vals(rows, col):

return set([row[col] for row in rows])
def class_counts(rows):
counts = {}

# A dictionary of label -> count.

for row in rows:

# In our dataset format, the label is always the last column

label = row[-1]
if label not in counts:
counts[label] = 0
counts[label] += 1

Dept. of Computer Science & Engineering (AI & ML) Page No: 16
Data Science and its applications Lab (21AD62)

return counts
def is_numeric(value):
return isinstance(value, int) or isinstance(value, float)
def gini(rows):
counts = class_counts(rows)
impurity = 1
for lbl in counts:
prob_of_lbl = counts[lbl] / float(len(rows))
impurity -= prob_of_lbl**2
return impurity

def info_gain(left, right, current_uncertainty):

p = float(len(left)) / (len(left) + len(right))
return current_uncertainty - p * gini(left) - (1 - p) * gini(right)

def find_best_split(rows):
best_gain = 0
best_feature = None
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1

for col in range(n_features): # For each feature

values = set([row[col] for row in rows]) # Unique values in the column

for val in values: # For each value

true_rows, false_rows = split(rows, col, val)
if len(true_rows) == 0 or len(false_rows) == 0:
continue
gain = info_gain(true_rows, false_rows, current_uncertainty)
if gain >= best_gain:
best_gain, best_feature = gain, (col, val)
return best_gain, best_feature

def split(rows, col, value):

true_rows, false_rows = [], []
for row in rows:

Dept. of Computer Science & Engineering (AI & ML) Page No: 17
Data Science and its applications Lab (21AD62)

if row[col] == value:
true_rows.append(row)
else:
false_rows.append(row)
return true_rows, false_rows

def build_tree(rows):
gain, (feature, value) = find_best_split(rows)
if gain == 0:
return Node(results=class_counts(rows))
true_rows, false_rows = split(rows, feature, value)
true_branch = build_tree(true_rows)
false_branch = build_tree(false_rows)
return Node(feature, value, true_branch=true_branch, false_branch=false_branch)

def print_tree(node, spacing=""):

if node.results is not None: # Leaf node
print(spacing + "Predict", node.results)
return
print(spacing + str(node.feature) + ':' + str(node.value))
print(spacing + '--> True:')
print_tree(node.true_branch, spacing + " ")
print(spacing + '--> False:')
print_tree(node.false_branch, spacing + " ")

def classify(row, node):

if node.results is not None:
return node.results
if is_numeric(row[node.feature]):
if row[node.feature] >= node.value:
return classify(row, node.true_branch)
else:
return classify(row, node.false_branch)
else:
if row[node.feature] == node.value:

Dept. of Computer Science & Engineering (AI & ML) Page No: 18
Data Science and its applications Lab (21AD62)

return classify(row, node.true_branch)

else:
return classify(row, node.false_branch)
# Define the datase
dataset = [
['Low', 'Low', 2, 'No', 'Yes'],
['Low', 'Med', 4, 'Yes', 'Yes'],
['Low', 'Low', 4, 'No', 'Yes'],
['Low', 'Med', 4, 'No', 'No'],
['Low', 'High', 4, 'No', 'No'],
['Med', 'Med', 4, 'No', 'No'],
['Med', 'Med', 4, 'Yes', 'Yes'],
['Med', 'High', 2, 'Yes', 'No'],
['Med', 'High', 5, 'No', 'Yes'],
['High', 'Med', 4, 'Yes', 'Yes'],
['High', 'Med', 2, 'Yes', 'Yes'],
['High', 'High', 2, 'Yes', 'No'],
['High', 'High', 5, 'Yes', 'Yes']
]
# Build the tree
tree = build_tree(dataset)
# Print the tree
print("Decision Tree:")
print_tree(tree)
# Test the tree
test_data = ['Med', 'Low', 4, 'No'] # Test data instance
print("\nClassifying Test Data:")
print("Test Data:", test_data)
print("Classification:", classify(test_data, tree))

Dept. of Computer Science & Engineering (AI & ML) Page No: 19
Data Science and its applications Lab (21AD62)

Output:
Decision Tree:
1:High
--> True:
2:5
--> True:
Predict {'Yes': 2}
--> False:
Predict {'No': 3}
--> False:
3:Yes
--> True:
Predict {'Yes': 4}
--> False:
1:Low
--> True:
Predict {'Yes': 2}
--> False:
Predict {'No': 2}

Classifying Test Data:

Test Data: ['Med', 'Low', 4, 'No']
Classification: {'Yes': 2}

Dept. of Computer Science & Engineering (AI & ML) Page No: 20
Data Science and its applications Lab (21AD62)

9) Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in

the dataset correspond to the co-ordinates of each data point. The third column
corresponds to the actual cluster label. Compute the rand index for the following
methods
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering
 Also visualize the dataset and which algorithm will be able to recover the true
clusters
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import dendrogram
# Load the dataset
data = np.loadtxt('spiral.txt')

# Extract features and true labels

X = data[:, :2]
true_labels = data[:, 2]

# Visualize the dataset

plt.scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis')
plt.title("True Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
# Compute Rand Index for K-means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
rand_index_kmeans = adjusted_rand_score(true_labels, kmeans_labels)
print("Rand Index for K-means Clustering:", rand_index_kmeans)

# Compute Rand Index for Single-link Hierarchical Clustering

Dept. of Computer Science & Engineering (AI & ML) Page No: 21
Data Science and its applications Lab (21AD62)

single_link = AgglomerativeClustering(n_clusters=3, linkage='single')

single_link_labels = single_link.fit_predict(X)

rand_index_single_link = adjusted_rand_score(true_labels, single_link_labels)

print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)

# Compute Rand Index for Complete-link Hierarchical Clustering

complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_labels = complete_link.fit_predict(X)
rand_index_complete_link = adjusted_rand_score(true_labels, complete_link_labels)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)

# Visualize the clusters from different algorithms

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
plt.title("K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.subplot(1, 3, 2)
plt.scatter(X[:, 0], X[:, 1], c=single_link_labels, cmap='viridis')
plt.title("Single-link Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.subplot(1, 3, 3)
plt.scatter(X[:, 0], X[:, 1], c=complete_link_labels, cmap='viridis')
plt.title("Complete-link Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.show()

Dept. of Computer Science & Engineering (AI & ML) Page No: 22
Data Science and its applications Lab (21AD62)

Output:
Rand Index for K-means Clustering: 0.5474568137016044
Rand Index for Single-link Hierarchical Clustering: 0.1923106542083827
Rand Index for Complete-link Hierarchical Clustering: 0.16319049103784852

Dept. of Computer Science & Engineering (AI & ML) Page No: 23
Data Science and its applications Lab (21AD62)

Content Beyond Syllabus

1. Write a program to demonstrate Regression analysis with residual plots on a given data set.

Code:
import numpy as np
import matplotlib. pyplot as plt
def estimate_coef(x, y):

# number of observations/points
n = np.size(x)

# mean of x and y vector

mx = np.mean(x)
my = np.mean(y)

# calculating cross-deviation and deviation about x

sxy = np.sum(y*x) - n*my*mx
sxx = np.sum(x*x) - n*mx*mx

# calculating regression coefficients

b1 = sxy / sxx
b0 = my - b1*mx
return (b0, b1)
def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot

plt.scatter(x, y, color = "m",marker = "o", s = 30)

# predicted response vector

ypred = b[0] + b[1]*x

# plotting the regression line

plt.plot(x, ypred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot

Dept. of Computer Science & Engineering (AI & ML) Page No: 24
Data Science and its applications Lab (21AD62)

plt.show()
def main():

# observations or data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb0 = {} \nb1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

OUTPUT:
Estimated coefficients:
b0 = 1.2363636363636363
b1 = 1.1696969696969697

RESULT: The computation for Simple Linear Regression was successfully completed

Dept. of Computer Science & Engineering (AI & ML) Page No: 25
Data Science and its applications Lab (21AD62)

2. Write a program to visualize the probability distribution of a sample against a single

continuous attribute.
Code:
# importing the required libraries
from sklearn import datasets
import pandas as pd
import seaborn as sns
import matplotlib. pyplot as plt
%matplotlib inline

# Setting up the Data Frame

iris = datasets. load_iris()

iris_df = pd.DataFrame(iris.data, columns=['Sepal_Length',

'Sepal_Width', 'Patal_Length', 'Petal_Width'])

iris_df['Target'] = iris.target

iris_df['Target'].replace([0], 'Iris_Setosa', inplace=True)

iris_df['Target'].replace([1], 'Iris_Vercicolor', inplace=True)
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)

# Plotting the KDE Plot

sns.kdeplot(iris_df.loc[(iris_df['Target']=='Iris_Virginica'),
'Sepal_Length'], color='b', shade=True, label='Iris_Virginica')

# Setting the X and Y Label

plt.xlabel('Sepal Length')
plt.ylabel('Probability Density')

OUTPUT:

Dept. of Computer Science & Engineering (AI & ML) Page No: 26
Data Science and its applications Lab (21AD62)

RESULT: Visualizing the probability distribution of a sample against a single continuous

attribute has been executed successfully.

VIVA QUESTION & ANSWERS

1. What are the advantages of using Visual Studio Code as an editor for Python and R?
Visual Studio Code offers a lightweight yet powerful environment with features like syntax
highlighting, code completion, debugging, and Git integration. It supports multiple languages,
including Python and R, making it convenient for developers who work with diverse technologies.

2. How can you access and use datasets from Kaggle?

By download datasets from Kaggle by creating an account on their website and navigating to the
dataset. Once downloaded, the data can be imported coding environment.

3. Why might you choose Visual Studio Code or PyCharm for coding projects involving
datasets?
Both Visual Studio Code and PyCharm offer robust features for coding, debugging, and project
management. PyCharm, specifically tailored for Python development, provides comprehensive
support for scientific libraries and tools like pandas and numpy. Visual Studio Code, while more
lightweight, offers flexibility and a wide range of extensions, making it suitable for various
programming tasks, including data analysis and visualization.

4. How would you plot a line chart in Python/R to visualize the relationship between two
variables?
By using libraries like Matplotlib in Python or ggplot2 in R to plot a line chart. Specify the
variables for the x-axis and y-axis and customize the appearance of the chart as needed.

5. Describe the process of creating a histogram to examine the frequency distribution of a

variable.
To create a histogram, divide the range of the variable into bins and count the number of

Dept. of Computer Science & Engineering (AI & ML) Page No: 27
Data Science and its applications Lab (21AD62)

observations falling into each bin. Plot the bins along the x-axis and the frequency or density of
observations in each bin along the y-axis. This visualization helps to understand the distribution and
central tendency of the variable.

6. What steps are involved in importing and preprocessing a dataset in Python using pandas?
First, import the dataset into a pandas DataFrame. Then, identify and drop any irrelevant
columns. Next, clean and transform the data as needed, which may include changing the index,
handling missing values, and formatting fields using regular expressions or other methods.

7. How can regular expressions be used to tidy up fields in a dataset?

Regular expressions are useful for pattern matching and extracting specific information from
strings. They can be applied to fields like dates to standardize the format or extract relevant
components such as year, month, and day.

8. What is logistic regression, and how does regularization affect its performance?
Regression is a linear classification algorithm used to predict the probability of a binary outcome
based on one or more predictor variables. Regularization introduces a penalty term to the loss
function, which helps prevent overfitting by penalizing large coefficients. A higher regularization
parameter shrinks the coefficients towards zero, leading to a simpler model with potentially better
generalization.

9. Explain the concept of hyperparameters in machine learning models.

Hyperparameters are parameters that are set before the learning process begins and determine the
behavior of the learning algorithm. They are not learned from the data but rather specified by the
user. Examples include regularization strength, learning rate, and kernel type in SVM.

10. What is the ID3 algorithm, and how does it construct decision trees?
The ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used for
classification. It recursively selects the best attribute to split the data based on information gain or
another criterion and partitions the data accordingly. This process continues until all instances in a
node belong to the same class or other stopping criteria are met.

Dept. of Computer Science & Engineering (AI & ML) Page No: 28
Data Science and its applications Lab (21AD62)

11. What are some evaluation metrics used to assess the performance of clustering algorithms?
Common evaluation metrics for clustering include the Rand Index, Silhouette Score, and Davies-
Bouldin Index. These metrics assess aspects such as cluster separation, cohesion, and similarity to
ground truth labels.

12. How can web scraping be used to gather data from social media platforms?
Web scraping involves extracting data from websites using automated scripts or tools. Social
media platforms often provide APIs (Application Programming Interfaces) that allow developers to
access data programmatically. Alternatively, web scraping techniques can be employed to extract
information from public profiles or pages.

13. What are some ethical considerations when scraping data from social media platforms?
It's essential to respect the terms of service of the platform and obtain consent from users if their
data is being collected. Additionally, be mindful of privacy concerns and avoid scraping sensitive or
personal information without proper authorization.

14. Describe the process of evaluating machine learning models and why it's essential.
Evaluating machine learning models involves assessing their performance on unseen data to
understand how well they generalize to new instances. This process helps determine if the model has
learned meaningful patterns from the training data and can make accurate predictions on real-world
data. Common evaluation techniques include splitting the data into training and testing sets, cross-
validation, and using metrics such as accuracy, precision, recall, and F1-score. Evaluating models
ensures that they meet the desired level of performance and reliability for the intended application.
Additionally, it helps identify areas for improvement and guides the selection of appropriate
algorithms and hyperparameters.

***END***

Dept. of Computer Science & Engineering (AI & ML) Page No: 29

The JavaScript Workbook - Download Edition
100% (6)
The JavaScript Workbook - Download Edition
221 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
85 pages
Fds Fundamentals of Data Science Laboratory
No ratings yet
Fds Fundamentals of Data Science Laboratory
53 pages
CS3362 - Data Science Laboratory - Manual - Final-1
No ratings yet
CS3362 - Data Science Laboratory - Manual - Final-1
76 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
28 pages
Eda Lab Manual
No ratings yet
Eda Lab Manual
69 pages
Book 0.1.0 PDF
No ratings yet
Book 0.1.0 PDF
147 pages
Python in Hidrology Book
100% (1)
Python in Hidrology Book
153 pages
SL-III Lab Manual
No ratings yet
SL-III Lab Manual
74 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
CS3352 FDS
No ratings yet
CS3352 FDS
23 pages
OCS353 - Data Science Manual-FULL
No ratings yet
OCS353 - Data Science Manual-FULL
64 pages
Data Science Lab
No ratings yet
Data Science Lab
61 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
CSL 410 L05
No ratings yet
CSL 410 L05
17 pages
Numpy Notes
No ratings yet
Numpy Notes
38 pages
CS 3361 Data Science Laboratory Syllabus
No ratings yet
CS 3361 Data Science Laboratory Syllabus
1 page
DSP U2
No ratings yet
DSP U2
172 pages
DSBDA Manual
No ratings yet
DSBDA Manual
76 pages
IDS18ME124 Internship
No ratings yet
IDS18ME124 Internship
14 pages
DS Manual-1
No ratings yet
DS Manual-1
29 pages
Python Ca22
No ratings yet
Python Ca22
14 pages
MDA File
No ratings yet
MDA File
37 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Sukhrobov Siroj Resume
No ratings yet
Sukhrobov Siroj Resume
5 pages
MGNM801 Ca2 Final
No ratings yet
MGNM801 Ca2 Final
13 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
51 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
DAV EXP 1 t12 31
No ratings yet
DAV EXP 1 t12 31
39 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
Front Page-2 Cse
No ratings yet
Front Page-2 Cse
2 pages
Midterm Dennis
No ratings yet
Midterm Dennis
5 pages
FRONT PAGE-2 Cyber
No ratings yet
FRONT PAGE-2 Cyber
2 pages
Data Ty
No ratings yet
Data Ty
59 pages
IDS Syllabus
No ratings yet
IDS Syllabus
5 pages
FDS - 1 Solved
No ratings yet
FDS - 1 Solved
17 pages
Suraj Report File
No ratings yet
Suraj Report File
17 pages
Intro To DS Assignmnt 1 (Amna Iqbal) ....
No ratings yet
Intro To DS Assignmnt 1 (Amna Iqbal) ....
4 pages
Data Science
No ratings yet
Data Science
42 pages
l9 Scientific Python Proc
No ratings yet
l9 Scientific Python Proc
30 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
6C - Data Science - Syllabus - 01
No ratings yet
6C - Data Science - Syllabus - 01
4 pages
DS Final
No ratings yet
DS Final
46 pages
Data Science
No ratings yet
Data Science
10 pages
FDS Lab
No ratings yet
FDS Lab
43 pages
Data Science
No ratings yet
Data Science
8 pages
DEV Lab Material
No ratings yet
DEV Lab Material
16 pages
DSF Model Question
No ratings yet
DSF Model Question
2 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
3 pages
CSE3041 Syllabus
No ratings yet
CSE3041 Syllabus
5 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
74 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Data Science Machine Learning 17054
No ratings yet
Data Science Machine Learning 17054
27 pages
Empowerment Technologies Unit 1 Lesson 1 Introduction To Information and Communication Technologies
No ratings yet
Empowerment Technologies Unit 1 Lesson 1 Introduction To Information and Communication Technologies
17 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Institute Management Report
No ratings yet
Institute Management Report
175 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Cs3353 Foundations of Data Science L T P C 3 0 0 3
No ratings yet
Cs3353 Foundations of Data Science L T P C 3 0 0 3
2 pages
A3 Line Fault Indicator
No ratings yet
A3 Line Fault Indicator
18 pages
AN - 396 FTDI Drivers Installation Guide For Windows 10
No ratings yet
AN - 396 FTDI Drivers Installation Guide For Windows 10
38 pages
Radio
No ratings yet
Radio
21 pages
TS 01470 - 0.00 - Requirements Schema
No ratings yet
TS 01470 - 0.00 - Requirements Schema
36 pages
Cookbook - IBM AI + Sust
No ratings yet
Cookbook - IBM AI + Sust
28 pages
HCL Career Development Centre, Gorakhpur Uttar Pradesh India
No ratings yet
HCL Career Development Centre, Gorakhpur Uttar Pradesh India
67 pages
Scope Statement Lite Template
No ratings yet
Scope Statement Lite Template
4 pages
F 003
No ratings yet
F 003
39 pages
Modern Marketing Data Stack 2025 Report
No ratings yet
Modern Marketing Data Stack 2025 Report
85 pages
Data Communications Chapter 7 Network Security
No ratings yet
Data Communications Chapter 7 Network Security
26 pages
Css Cheat Sheet: h1 (Color: #333 )
No ratings yet
Css Cheat Sheet: h1 (Color: #333 )
4 pages
SpendoliniHacking Oracle APEX
No ratings yet
SpendoliniHacking Oracle APEX
14 pages
Programs C
No ratings yet
Programs C
139 pages
Creo Elements Direct Product Update and Roadmap en Pub
No ratings yet
Creo Elements Direct Product Update and Roadmap en Pub
23 pages
DBMS P Shits
No ratings yet
DBMS P Shits
41 pages
SOLUTIONS - Skill Badges
No ratings yet
SOLUTIONS - Skill Badges
2 pages
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
No ratings yet
Drawing Realistic Ipad2 - Photoshop Tutorial: Step 1: Ipad Basic Shape
54 pages
DB2 For I Parallel Data Load
No ratings yet
DB2 For I Parallel Data Load
10 pages
DBMS Lec 1
No ratings yet
DBMS Lec 1
19 pages
Networking Basics
No ratings yet
Networking Basics
13 pages
Simpow NEPLAN Integrationen
No ratings yet
Simpow NEPLAN Integrationen
9 pages
National Instruments Privacy Statement: What Information Does NI Gather Online?
No ratings yet
National Instruments Privacy Statement: What Information Does NI Gather Online?
6 pages
Software Requirements Specification
No ratings yet
Software Requirements Specification
5 pages
The Hare & The Tortoise - 25 × 20 CM 2
No ratings yet
The Hare & The Tortoise - 25 × 20 CM 2
1 page
Abhinav's Resume
No ratings yet
Abhinav's Resume
1 page
What's New in Pro Tools 12.3
No ratings yet
What's New in Pro Tools 12.3
21 pages
130790HIST
No ratings yet
130790HIST
3 pages