100% found this document useful (1 vote)
50 views29 pages

DS Manual

Uploaded by

yashaswi1918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
50 views29 pages

DS Manual

Uploaded by

yashaswi1918
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Science and its applications Lab (21AD62)

DATA SCIENCE AND ITS APPLICATIONS

Subject code: 21AD62 IA Marks: 20


Hour/week: 02 Total Hours: 20

Course Outcomes:
After the completion of the course, the students will be able to:
CO1 – Demonstrate the proficiency with statistical analysis of data to derive insight from results and
interpret the data findings visually.
CO2 – Interpret the skills in data management by obtaining, cleaning and transforming the data.
CO3 – Applying machine learning models to solve the business-related challenges.
CO4 – Executing decision trees, neural network layers and data partition.
CO5 – Estimating how social clustering shape individuals and groups in contemporary society.

Sl List of Experiments CO’s Page


No. no
1 Installation of Python/R language, Visual Studio code editors can be CO1 4-6
demonstrated along with Kaggle data set usage
2 Write programs in Python/R and Execute them in either Visual Studio Code or CO1 7
PyCharm Community Edition or any other suitable environment
3 A study was conducted to understand the effect of number of hours the students CO1 8
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on
y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title

4 For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a CO2 9


histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)
5 Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle CO2 10-12
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which
contains information about books. Write a program to demonstrate the
following.
 Import the data into a Data Frame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the Data Frame
 Tidy up fields in the data such as date of publication with the help of simple
regular expression.
 Combine str methods with NumPy to clean columns
6 Train a regularized logistic regression classifier on the iris dataset CO3 13

Dept. of Computer Science & Engineering (AI & ML) Page No: 1
Data Science and its applications Lab (21AD62)

(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris


dataset) using sklearn. Train the model with the following hyperparameter C =
1e4 and report the best classification accuracy.

7 Train an SVM classifier on the iris dataset using sklearn. Try different kernels CO3 14-15
and the associated hyperparameters. Train model with the following set of
hyperparameters RBF kernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data.

8 Consider the following dataset. Write a program to demonstrate the working of CO4 16-20
the decision tree based ID3 algorithm.

9 Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in CO4 21-23
the dataset corresponds to the co-ordinates of each data point. The third column
corresponds to the actual cluster label. Compute the rand index for the following
methods
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering.
Also visualize the dataset and which algorithm will be able to recover the true
clusters.

10 Mini Project – Simple web scrapping in social media.

CONTENT BEYOND SYLLABUS


Dept. of Computer Science & Engineering (AI & ML) Page No: 2
Data Science and its applications Lab (21AD62)

Sl List of Experiment CO’s Page


No. no

1 Write a program to demonstrate Regression analysis with residual plots CO1 24-25
on a given data set.

2 Write a program to visualize the probability distribution of a sample CO3 26


against a single continuous attribute.

Dept. of Computer Science & Engineering (AI & ML) Page No: 3
Data Science and its applications Lab (21AD62)

1) Installation of Python/R language, Visual Studio code editors can be


demonstrated along with Kaggle data set usage

Steps:
Installation of Python:
Download Python

 Go to the official Python website: Python.org.


 Download the latest version of Python for your operating system (Windows, macOS, or Linux).
Install Python:

 Run the downloaded installer.


 Check the box that says "Add Python to PATH" during installation.
 Follow the installation wizard.
Verify Installation:

 Open a terminal or command prompt.


 Type python --version and press Enter to install Python.
Installation of R Language:

Download R:

 Go to the official R Project website: R-Project.org.


 Download the appropriate installer for operating system.
Install R:

 Run the downloaded installer.


 Follow the installation instructions provided by the installer.

Verify Installation:

 Open a terminal or command prompt.


 Type R --version and press Enter to install R .

Installation of Visual Studio Code:

Download Visual Studio Code:

 Go to the official Visual Studio Code website: code.visualstudio.com.


 Download the installer for your operating system.
Install Visual Studio Code:

 Run the downloaded installer.


 Follow the installation instructions provided by the installer.

Dept. of Computer Science & Engineering (AI & ML) Page No: 4
Data Science and its applications Lab (21AD62)

Install Python and R Extensions:

 Open Visual Studio Code after installation.


 Go to the Extensions view by clicking on the square icon on the sidebar or pressing
Ctrl+Shift+X.
 Search for "Python" in the Extensions Marketplace and install the Python extension provided by
Microsoft.
 Search for "R" in the Extensions Marketplace and install the R extension provided by Yuki
Ueda.

Using Kaggle Datasets:

Sign up/Login to Kaggle:

 Go to the Kaggle website: Kaggle.com.


 Sign up for a Kaggle account .
Explore Datasets:

 Once logged in, navigate to the "Datasets" section from the top menu.
 Explore the available datasets by browsing or searching for specific topics.
Download Datasets:

 Click on a particular dataset.


 On the dataset page, find a "Download" button.
 Download the dataset in a format suitable for analysis (usually CSV or other common formats).
Import Datasets in Python or R:

In Python or R script within Visual Studio Code, the libraries are like pandas (for Python) or readr (for
R) to import the downloaded datasets and start analyzing them.

Import and Analyse Kaggle Dataset in Python:


Python script in Visual Studio Code, import pandas and read the downloaded CSV file:

Code:

import pandas as pd
print(pd.__version__)
import pandas as pd
# Read the dataset
df = pd.read_csv("desktop/water-quality-1.csv")
# Display the first few rows of the dataset
print(df.head())

Dept. of Computer Science & Engineering (AI & ML) Page No: 5
Data Science and its applications Lab (21AD62)

 Save your scripts.


 Run each script separately by clicking on the "Run" button in Visual Studio Code or by using the
appropriate keyboard shortcut (usually Ctrl+Enter).

Output:
Sample ID Grab ID Profile ID Sample Number Collect DateTime \
0 16316 16316.0 10702 9209019 04/13/1992 12:00:00 AM
1 8937 8937.0 37688 7915489 06/20/1979 12:00:00 AM
2 137745 137745.0 54368 L58228-1 06/25/2013 08:09:00 AM
3 131816 131816.0 50605 L55068-6 02/13/2012 09:38:00 AM
4 82325 82325.0 43896 L52933-87 03/30/2011 02:36:00 PM

Depth (m) Site Type Area Locator \


0 1.0 Streams and Rivers Pipers KSHZ06
1 1.0 Streams and Rivers Crisp 0321
2 1.0 Large Lakes Lake Union/Ship Canal 0512
3 1.0 Large Lakes Lake Union/Ship Canal 0540
4 4.2 Large Lakes Lake Washington 0804

Site ... MDL RDL \


0 Pipers Creek mouth ... NaN NaN
1 Crisp Creek mouth at SE Green Valley Rd ... NaN NaN
2 Ship Canal above locks ... NaN NaN
3 Ship Canal near Montlake Bridge ... 0.002 0.005
4 Lake Washington north end ... NaN NaN

Text Value Sample Info Steward Note \


0 .070||King County Nstream Database/B53311 NaN NaN
1 .727||King County Nstream Database/RS2 NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

Replicates Replicate Of Method Date Analyzed Data Source


0 NaN NaN none NaN KCEL
1 NaN NaN NaN NaN KCEL
2 NaN NaN HYDROLAB 06/25/2013 KCEL
3 NaN NaN SM4500-P-F 02/15/2012 KCEL
4 NaN NaN HYDROLAB NaN KCEL

[5 rows x 25 columns]

Dept. of Computer Science & Engineering (AI & ML) Page No: 6
Data Science and its applications Lab (21AD62)

2) Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment.

Example: Write Python code to find factorial of a number

Code:
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
num = int(input("Enter a number: "))
print("Factorial of", num, "is", factorial(num))

Output:
Enter a number: 5
Factorial of 5 is 120

Enter a number: 500


Factorial of 500 is
1220136825991110068701238785423046926253574342803192842192413588385845373153881
9976054964475022032818630136164771482035841633787220781772004807852051593292854
7790757193933060377296085908627042917454788242491272634430567017327076946106280
2310452644218878789465754777149863494367781037644274033827365397471386477878495
4384895955375379904232410612713269843277457155463099772027810145610811883737095
3101635632443298702956389662891165897476957208792692887128178007026517450776841
0719624390394322536422605234945850129918571501248706961568141625359056693423813
0088562492468915641267756544818865065938479517753608940057452389403357984763639
4490531306232374906644504882466507594673586207463792518420045936969298102226397
1952597190945217823331756934581508552332820762820023402626907898342451712006207
7146409794561161276291459512372299133401695523638509428855920187274337951730145
8635757082835578015873543276888868012039988238470215146760544540766353598417443
0480128938313896881639487469658817504506926365338175055478128640000000000000000
0000000000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000

Dept. of Computer Science & Engineering (AI & ML) Page No: 7
Data Science and its applications Lab (21AD62)

3) A study was conducted to understand the effect of number of hours the student
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on
y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title

Number of hrs 10 9 2 15 10 16 11 16
spent studying
(x)
Score in the 95 80 10 50 45 98 38 93
final
exam (0 – 100)
(y)

Code:

import matplotlib.pyplot as plt


# Data
hours_studied = [10, 9, 2, 15, 10, 16, 11, 16]
exam_scores = [95, 80, 10, 50, 45, 98, 38, 93]
# Plotting
plt.plot(hours_studied, exam_scores, marker='*', color='red', linestyle='-')
plt.xlabel('Number of hrs spent studying')
plt.ylabel('Score in the final exam (0 - 100)')
plt.title('Relationship between Hours Studied and Exam Scores')
plt.grid(True)
# Show plot
plt.show()

Output:

Dept. of Computer Science & Engineering (AI & ML) Page No: 8
Data Science and its applications Lab (21AD62)

4) For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a


histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)

Code:

import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
mtcars = pd.read_csv('desktop/mtcars.csv')
# Plot histogram
plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')
# Add labels and title
plt.xlabel('Miles per gallon')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Miles per Gallon')
# Show plot
plt.show()
Output:

Dept. of Computer Science & Engineering (AI & ML) Page No: 9
Data Science and its applications Lab (21AD62)

5) Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle


(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information about books. Write a program to demonstrate the following.

Steps:

 Import the data into a DataFrame


 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple regular expression.
 Combine str methods with NumPy to clean columns

Code:
import pandas as pd
import numpy as np

# Import the data into a DataFrame


books_df = pd.read_csv('desktop/BL-Flickr-Images-Book.csv')

# Display the first few rows of the DataFrame


print("Original DataFrame:")
print(books_df.head())

# Find and drop the columns which are irrelevant for the book information
columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former
owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=columns_to_drop, inplace=True)

# Change the Index of the DataFrame


books_df.set_index('Identifier', inplace=True)

# Tidy up fields in the data such as date of publication with the help of simple regular
expression
def clean_date(date):
if isinstance(date, str):
match = re.search(r'\d{4}', date)
if match:
return match.group()
return np.nan
books_df['Date of Publication'] = books_df['Date of Publication'].apply(clean_date)

# Combine str methods with NumPy to clean columns


books_df['Place of Publication'] = np.where(

Dept. of Computer Science & Engineering (AI & ML) Page No: 10
Data Science and its applications Lab (21AD62)

books_df['Place of Publication'].str.contains('London'),
'London',
np.where(
books_df['Place of Publication'].str.contains('Oxford'),
'Oxford',
books_df['Place of Publication'].replace(
r'^\s*$', 'Unknown', regex=True
)
)
)
# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(books_df.head())

Output:
Original DataFrame:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London

Date of Publication Publisher \


0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh

Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.

Contributors Corporate Author \


0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN

Dept. of Computer Science & Engineering (AI & ML) Page No: 11
Data Science and its applications Lab (21AD62)

Corporate Contributors Former owner Engraver Issuance type \


0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
2 NaN NaN NaN monographic
3 NaN NaN NaN monographic
4 NaN NaN NaN monographic

Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...

Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.

Cleaned DataFrame:
Place of Publication Date of Publication Publisher \Identifier
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh

Title Author Identifier


206 Walter Forbes. [A novel.] By A. A A. A.
216 All for Greed. [A novel. The dedication signed... A., A. A.
218 Love the Avenger. By the author of “All for Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.

Flickr URL
Identifier
206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...

Dept. of Computer Science & Engineering (AI & ML) Page No: 12
Data Science and its applications Lab (21AD62)

6) Train a regularized logistic regression classifier on the iris dataset


(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C = 1e4
and report the best classification accuracy.

Code:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Load the Iris dataset


iris = load_iris()
x = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with standardization and logistic regression


model = make_pipeline(StandardScaler(), LogisticRegression(C=1e4))

# Train the model


model.fit(X_train, y_train)

# Report the training accuracy


training_accuracy = model.score(X_train, y_train)
print(f"Training Accuracy: {training_accuracy}")

# Report the testing accuracy


testing_accuracy = model.score(X_test, y_test)
print(f"Testing Accuracy: {testing_accuracy}")

Output:
Training Accuracy: 0.9833333333333333
Testing Accuracy: 1.0

Dept. of Computer Science & Engineering (AI & ML) Page No: 13
Data Science and its applications Lab (21AD62)

7) Train an SVM classifier on the iris dataset using sklearn. Try different kernels
and the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data.

Code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameters


kernels = ['rbf']
gammas = [0.5]
Cs = [0.01, 1, 10]

# Initialize variables to store best accuracy and support vectors


best_accuracy = 0
best_support_vectors = None

# Iterate over different hyperparameters


for kernel in kernels:
for gamma in gammas:
for C in Cs:

# Create the SVM classifier with the specified hyperparameters


model = SVC(kernel=kernel, gamma=gamma, C=C, decision_function_shape='ovr')

Dept. of Computer Science & Engineering (AI & ML) Page No: 14
Data Science and its applications Lab (21AD62)

# Train the model


model.fit(X_train, y_train)

# Calculate accuracy on the test data


accuracy = model.score(X_test, y_test)

# Get the total number of support vectors


total_support_vectors = np.sum(model.n_support_)

# Update best accuracy and support vectors if current accuracy is better


if accuracy > best_accuracy:
best_accuracy = accuracy
best_support_vectors = total_support_vectors

# Print the current hyperparameters and corresponding accuracy


print(f"Kernel: {kernel}, Gamma: {gamma}, C: {C}, Accuracy: {accuracy}, Total Support
Vectors: {total_support_vectors}")

# Print the best accuracy and total number of support vectors


print(f"\nBest Accuracy: {best_accuracy}")
print(f"Total Support Vectors for Best Model: {best_support_vectors}")

Output:
Kernel: rbf, Gamma: 0.5, C: 0.01, Accuracy: 0.3, Total Support Vectors: 120
Kernel: rbf, Gamma: 0.5, C: 1, Accuracy: 1.0, Total Support Vectors: 39
Kernel: rbf, Gamma: 0.5, C: 10, Accuracy: 1.0, Total Support Vectors: 31

Best Accuracy: 1.0


Total Support Vectors for Best Model: 39

Dept. of Computer Science & Engineering (AI & ML) Page No: 15
Data Science and its applications Lab (21AD62)

8) Consider the following dataset. Write a program to demonstrate the working of


the decision tree based ID3 algorithm.

Code:
import numpy as np
class Node:
def __init__(self, feature=None, value=None, results=None, true_branch=None,
false_branch=None):
self.feature = feature # Feature to split on
self.value = value # Value of the feature
self.results = results # None for nodes, holds value for leaf nodes
self.true_branch = true_branch # Subtree for when the condition is true
self.false_branch = false_branch # Subtree for when the condition is false

def unique_vals(rows, col):


return set([row[col] for row in rows])
def class_counts(rows):
counts = {}

# A dictionary of label -> count.


for row in rows:

# In our dataset format, the label is always the last column


label = row[-1]
if label not in counts:
counts[label] = 0
counts[label] += 1

Dept. of Computer Science & Engineering (AI & ML) Page No: 16
Data Science and its applications Lab (21AD62)

return counts
def is_numeric(value):
return isinstance(value, int) or isinstance(value, float)
def gini(rows):
counts = class_counts(rows)
impurity = 1
for lbl in counts:
prob_of_lbl = counts[lbl] / float(len(rows))
impurity -= prob_of_lbl**2
return impurity

def info_gain(left, right, current_uncertainty):


p = float(len(left)) / (len(left) + len(right))
return current_uncertainty - p * gini(left) - (1 - p) * gini(right)

def find_best_split(rows):
best_gain = 0
best_feature = None
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1

for col in range(n_features): # For each feature


values = set([row[col] for row in rows]) # Unique values in the column

for val in values: # For each value


true_rows, false_rows = split(rows, col, val)
if len(true_rows) == 0 or len(false_rows) == 0:
continue
gain = info_gain(true_rows, false_rows, current_uncertainty)
if gain >= best_gain:
best_gain, best_feature = gain, (col, val)
return best_gain, best_feature

def split(rows, col, value):


true_rows, false_rows = [], []
for row in rows:

Dept. of Computer Science & Engineering (AI & ML) Page No: 17
Data Science and its applications Lab (21AD62)

if row[col] == value:
true_rows.append(row)
else:
false_rows.append(row)
return true_rows, false_rows

def build_tree(rows):
gain, (feature, value) = find_best_split(rows)
if gain == 0:
return Node(results=class_counts(rows))
true_rows, false_rows = split(rows, feature, value)
true_branch = build_tree(true_rows)
false_branch = build_tree(false_rows)
return Node(feature, value, true_branch=true_branch, false_branch=false_branch)

def print_tree(node, spacing=""):


if node.results is not None: # Leaf node
print(spacing + "Predict", node.results)
return
print(spacing + str(node.feature) + ':' + str(node.value))
print(spacing + '--> True:')
print_tree(node.true_branch, spacing + " ")
print(spacing + '--> False:')
print_tree(node.false_branch, spacing + " ")

def classify(row, node):


if node.results is not None:
return node.results
if is_numeric(row[node.feature]):
if row[node.feature] >= node.value:
return classify(row, node.true_branch)
else:
return classify(row, node.false_branch)
else:
if row[node.feature] == node.value:

Dept. of Computer Science & Engineering (AI & ML) Page No: 18
Data Science and its applications Lab (21AD62)

return classify(row, node.true_branch)


else:
return classify(row, node.false_branch)
# Define the datase
dataset = [
['Low', 'Low', 2, 'No', 'Yes'],
['Low', 'Med', 4, 'Yes', 'Yes'],
['Low', 'Low', 4, 'No', 'Yes'],
['Low', 'Med', 4, 'No', 'No'],
['Low', 'High', 4, 'No', 'No'],
['Med', 'Med', 4, 'No', 'No'],
['Med', 'Med', 4, 'Yes', 'Yes'],
['Med', 'High', 2, 'Yes', 'No'],
['Med', 'High', 5, 'No', 'Yes'],
['High', 'Med', 4, 'Yes', 'Yes'],
['High', 'Med', 2, 'Yes', 'Yes'],
['High', 'High', 2, 'Yes', 'No'],
['High', 'High', 5, 'Yes', 'Yes']
]
# Build the tree
tree = build_tree(dataset)
# Print the tree
print("Decision Tree:")
print_tree(tree)
# Test the tree
test_data = ['Med', 'Low', 4, 'No'] # Test data instance
print("\nClassifying Test Data:")
print("Test Data:", test_data)
print("Classification:", classify(test_data, tree))

Dept. of Computer Science & Engineering (AI & ML) Page No: 19
Data Science and its applications Lab (21AD62)

Output:
Decision Tree:
1:High
--> True:
2:5
--> True:
Predict {'Yes': 2}
--> False:
Predict {'No': 3}
--> False:
3:Yes
--> True:
Predict {'Yes': 4}
--> False:
1:Low
--> True:
Predict {'Yes': 2}
--> False:
Predict {'No': 2}

Classifying Test Data:


Test Data: ['Med', 'Low', 4, 'No']
Classification: {'Yes': 2}

Dept. of Computer Science & Engineering (AI & ML) Page No: 20
Data Science and its applications Lab (21AD62)

9) Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in


the dataset correspond to the co-ordinates of each data point. The third column
corresponds to the actual cluster label. Compute the rand index for the following
methods
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering
 Also visualize the dataset and which algorithm will be able to recover the true
clusters
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import dendrogram
# Load the dataset
data = np.loadtxt('spiral.txt')

# Extract features and true labels


X = data[:, :2]
true_labels = data[:, 2]

# Visualize the dataset


plt.scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis')
plt.title("True Clusters")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
# Compute Rand Index for K-means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
rand_index_kmeans = adjusted_rand_score(true_labels, kmeans_labels)
print("Rand Index for K-means Clustering:", rand_index_kmeans)

# Compute Rand Index for Single-link Hierarchical Clustering


Dept. of Computer Science & Engineering (AI & ML) Page No: 21
Data Science and its applications Lab (21AD62)

single_link = AgglomerativeClustering(n_clusters=3, linkage='single')


single_link_labels = single_link.fit_predict(X)

rand_index_single_link = adjusted_rand_score(true_labels, single_link_labels)


print("Rand Index for Single-link Hierarchical Clustering:", rand_index_single_link)

# Compute Rand Index for Complete-link Hierarchical Clustering


complete_link = AgglomerativeClustering(n_clusters=3, linkage='complete')
complete_link_labels = complete_link.fit_predict(X)
rand_index_complete_link = adjusted_rand_score(true_labels, complete_link_labels)
print("Rand Index for Complete-link Hierarchical Clustering:", rand_index_complete_link)

# Visualize the clusters from different algorithms


plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
plt.title("K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.subplot(1, 3, 2)
plt.scatter(X[:, 0], X[:, 1], c=single_link_labels, cmap='viridis')
plt.title("Single-link Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.subplot(1, 3, 3)
plt.scatter(X[:, 0], X[:, 1], c=complete_link_labels, cmap='viridis')
plt.title("Complete-link Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.show()

Dept. of Computer Science & Engineering (AI & ML) Page No: 22
Data Science and its applications Lab (21AD62)

Output:
Rand Index for K-means Clustering: 0.5474568137016044
Rand Index for Single-link Hierarchical Clustering: 0.1923106542083827
Rand Index for Complete-link Hierarchical Clustering: 0.16319049103784852

Dept. of Computer Science & Engineering (AI & ML) Page No: 23
Data Science and its applications Lab (21AD62)

Content Beyond Syllabus

1. Write a program to demonstrate Regression analysis with residual plots on a given data set.

Code:
import numpy as np
import matplotlib. pyplot as plt
def estimate_coef(x, y):

# number of observations/points
n = np.size(x)

# mean of x and y vector


mx = np.mean(x)
my = np.mean(y)

# calculating cross-deviation and deviation about x


sxy = np.sum(y*x) - n*my*mx
sxx = np.sum(x*x) - n*mx*mx

# calculating regression coefficients


b1 = sxy / sxx
b0 = my - b1*mx
return (b0, b1)
def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot


plt.scatter(x, y, color = "m",marker = "o", s = 30)

# predicted response vector


ypred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, ypred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


Dept. of Computer Science & Engineering (AI & ML) Page No: 24
Data Science and its applications Lab (21AD62)

plt.show()
def main():

# observations or data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb0 = {} \nb1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

OUTPUT:
Estimated coefficients:
b0 = 1.2363636363636363
b1 = 1.1696969696969697

RESULT: The computation for Simple Linear Regression was successfully completed

Dept. of Computer Science & Engineering (AI & ML) Page No: 25
Data Science and its applications Lab (21AD62)

2. Write a program to visualize the probability distribution of a sample against a single


continuous attribute.
Code:
# importing the required libraries
from sklearn import datasets
import pandas as pd
import seaborn as sns
import matplotlib. pyplot as plt
%matplotlib inline

# Setting up the Data Frame


iris = datasets. load_iris()

iris_df = pd.DataFrame(iris.data, columns=['Sepal_Length',


'Sepal_Width', 'Patal_Length', 'Petal_Width'])

iris_df['Target'] = iris.target

iris_df['Target'].replace([0], 'Iris_Setosa', inplace=True)


iris_df['Target'].replace([1], 'Iris_Vercicolor', inplace=True)
iris_df['Target'].replace([2], 'Iris_Virginica', inplace=True)

# Plotting the KDE Plot


sns.kdeplot(iris_df.loc[(iris_df['Target']=='Iris_Virginica'),
'Sepal_Length'], color='b', shade=True, label='Iris_Virginica')

# Setting the X and Y Label


plt.xlabel('Sepal Length')
plt.ylabel('Probability Density')

OUTPUT:

Dept. of Computer Science & Engineering (AI & ML) Page No: 26
Data Science and its applications Lab (21AD62)

RESULT: Visualizing the probability distribution of a sample against a single continuous


attribute has been executed successfully.

VIVA QUESTION & ANSWERS

1. What are the advantages of using Visual Studio Code as an editor for Python and R?
Visual Studio Code offers a lightweight yet powerful environment with features like syntax
highlighting, code completion, debugging, and Git integration. It supports multiple languages,
including Python and R, making it convenient for developers who work with diverse technologies.

2. How can you access and use datasets from Kaggle?


By download datasets from Kaggle by creating an account on their website and navigating to the
dataset. Once downloaded, the data can be imported coding environment.

3. Why might you choose Visual Studio Code or PyCharm for coding projects involving
datasets?
Both Visual Studio Code and PyCharm offer robust features for coding, debugging, and project
management. PyCharm, specifically tailored for Python development, provides comprehensive
support for scientific libraries and tools like pandas and numpy. Visual Studio Code, while more
lightweight, offers flexibility and a wide range of extensions, making it suitable for various
programming tasks, including data analysis and visualization.

4. How would you plot a line chart in Python/R to visualize the relationship between two
variables?
By using libraries like Matplotlib in Python or ggplot2 in R to plot a line chart. Specify the
variables for the x-axis and y-axis and customize the appearance of the chart as needed.

5. Describe the process of creating a histogram to examine the frequency distribution of a


variable.
To create a histogram, divide the range of the variable into bins and count the number of

Dept. of Computer Science & Engineering (AI & ML) Page No: 27
Data Science and its applications Lab (21AD62)

observations falling into each bin. Plot the bins along the x-axis and the frequency or density of
observations in each bin along the y-axis. This visualization helps to understand the distribution and
central tendency of the variable.

6. What steps are involved in importing and preprocessing a dataset in Python using pandas?
First, import the dataset into a pandas DataFrame. Then, identify and drop any irrelevant
columns. Next, clean and transform the data as needed, which may include changing the index,
handling missing values, and formatting fields using regular expressions or other methods.

7. How can regular expressions be used to tidy up fields in a dataset?


Regular expressions are useful for pattern matching and extracting specific information from
strings. They can be applied to fields like dates to standardize the format or extract relevant
components such as year, month, and day.

8. What is logistic regression, and how does regularization affect its performance?
Regression is a linear classification algorithm used to predict the probability of a binary outcome
based on one or more predictor variables. Regularization introduces a penalty term to the loss
function, which helps prevent overfitting by penalizing large coefficients. A higher regularization
parameter shrinks the coefficients towards zero, leading to a simpler model with potentially better
generalization.

9. Explain the concept of hyperparameters in machine learning models.


Hyperparameters are parameters that are set before the learning process begins and determine the
behavior of the learning algorithm. They are not learned from the data but rather specified by the
user. Examples include regularization strength, learning rate, and kernel type in SVM.

10. What is the ID3 algorithm, and how does it construct decision trees?
The ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used for
classification. It recursively selects the best attribute to split the data based on information gain or
another criterion and partitions the data accordingly. This process continues until all instances in a
node belong to the same class or other stopping criteria are met.

Dept. of Computer Science & Engineering (AI & ML) Page No: 28
Data Science and its applications Lab (21AD62)

11. What are some evaluation metrics used to assess the performance of clustering algorithms?
Common evaluation metrics for clustering include the Rand Index, Silhouette Score, and Davies-
Bouldin Index. These metrics assess aspects such as cluster separation, cohesion, and similarity to
ground truth labels.

12. How can web scraping be used to gather data from social media platforms?
Web scraping involves extracting data from websites using automated scripts or tools. Social
media platforms often provide APIs (Application Programming Interfaces) that allow developers to
access data programmatically. Alternatively, web scraping techniques can be employed to extract
information from public profiles or pages.

13. What are some ethical considerations when scraping data from social media platforms?
It's essential to respect the terms of service of the platform and obtain consent from users if their
data is being collected. Additionally, be mindful of privacy concerns and avoid scraping sensitive or
personal information without proper authorization.

14. Describe the process of evaluating machine learning models and why it's essential.
Evaluating machine learning models involves assessing their performance on unseen data to
understand how well they generalize to new instances. This process helps determine if the model has
learned meaningful patterns from the training data and can make accurate predictions on real-world
data. Common evaluation techniques include splitting the data into training and testing sets, cross-
validation, and using metrics such as accuracy, precision, recall, and F1-score. Evaluating models
ensures that they meet the desired level of performance and reliability for the intended application.
Additionally, it helps identify areas for improvement and guides the selection of appropriate
algorithms and hyperparameters.

***END***

Dept. of Computer Science & Engineering (AI & ML) Page No: 29

You might also like