DS Manual
DS Manual
Course Outcomes:
After the completion of the course, the students will be able to:
CO1 – Demonstrate the proficiency with statistical analysis of data to derive insight from results and
interpret the data findings visually.
CO2 – Interpret the skills in data management by obtaining, cleaning and transforming the data.
CO3 – Applying machine learning models to solve the business-related challenges.
CO4 – Executing decision trees, neural network layers and data partition.
CO5 – Estimating how social clustering shape individuals and groups in contemporary society.
Dept. of Computer Science & Engineering (AI & ML) Page No: 1
Data Science and its applications Lab (21AD62)
7 Train an SVM classifier on the iris dataset using sklearn. Try different kernels CO3 14-15
and the associated hyperparameters. Train model with the following set of
hyperparameters RBF kernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data.
8 Consider the following dataset. Write a program to demonstrate the working of CO4 16-20
the decision tree based ID3 algorithm.
9 Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in CO4 21-23
the dataset corresponds to the co-ordinates of each data point. The third column
corresponds to the actual cluster label. Compute the rand index for the following
methods
K – means Clustering
Single – link Hierarchical Clustering
Complete link hierarchical clustering.
Also visualize the dataset and which algorithm will be able to recover the true
clusters.
1 Write a program to demonstrate Regression analysis with residual plots CO1 24-25
on a given data set.
Dept. of Computer Science & Engineering (AI & ML) Page No: 3
Data Science and its applications Lab (21AD62)
Steps:
Installation of Python:
Download Python
Download R:
Verify Installation:
Dept. of Computer Science & Engineering (AI & ML) Page No: 4
Data Science and its applications Lab (21AD62)
Once logged in, navigate to the "Datasets" section from the top menu.
Explore the available datasets by browsing or searching for specific topics.
Download Datasets:
In Python or R script within Visual Studio Code, the libraries are like pandas (for Python) or readr (for
R) to import the downloaded datasets and start analyzing them.
Code:
import pandas as pd
print(pd.__version__)
import pandas as pd
# Read the dataset
df = pd.read_csv("desktop/water-quality-1.csv")
# Display the first few rows of the dataset
print(df.head())
Dept. of Computer Science & Engineering (AI & ML) Page No: 5
Data Science and its applications Lab (21AD62)
Output:
Sample ID Grab ID Profile ID Sample Number Collect DateTime \
0 16316 16316.0 10702 9209019 04/13/1992 12:00:00 AM
1 8937 8937.0 37688 7915489 06/20/1979 12:00:00 AM
2 137745 137745.0 54368 L58228-1 06/25/2013 08:09:00 AM
3 131816 131816.0 50605 L55068-6 02/13/2012 09:38:00 AM
4 82325 82325.0 43896 L52933-87 03/30/2011 02:36:00 PM
[5 rows x 25 columns]
Dept. of Computer Science & Engineering (AI & ML) Page No: 6
Data Science and its applications Lab (21AD62)
2) Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment.
Code:
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
num = int(input("Enter a number: "))
print("Factorial of", num, "is", factorial(num))
Output:
Enter a number: 5
Factorial of 5 is 120
Dept. of Computer Science & Engineering (AI & ML) Page No: 7
Data Science and its applications Lab (21AD62)
3) A study was conducted to understand the effect of number of hours the student
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on
y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title
Number of hrs 10 9 2 15 10 16 11 16
spent studying
(x)
Score in the 95 80 10 50 45 98 38 93
final
exam (0 – 100)
(y)
Code:
Output:
Dept. of Computer Science & Engineering (AI & ML) Page No: 8
Data Science and its applications Lab (21AD62)
Code:
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
mtcars = pd.read_csv('desktop/mtcars.csv')
# Plot histogram
plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')
# Add labels and title
plt.xlabel('Miles per gallon')
plt.ylabel('Frequency')
plt.title('Frequency Distribution of Miles per Gallon')
# Show plot
plt.show()
Output:
Dept. of Computer Science & Engineering (AI & ML) Page No: 9
Data Science and its applications Lab (21AD62)
Steps:
Code:
import pandas as pd
import numpy as np
# Find and drop the columns which are irrelevant for the book information
columns_to_drop = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former
owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=columns_to_drop, inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular
expression
def clean_date(date):
if isinstance(date, str):
match = re.search(r'\d{4}', date)
if match:
return match.group()
return np.nan
books_df['Date of Publication'] = books_df['Date of Publication'].apply(clean_date)
Dept. of Computer Science & Engineering (AI & ML) Page No: 10
Data Science and its applications Lab (21AD62)
books_df['Place of Publication'].str.contains('London'),
'London',
np.where(
books_df['Place of Publication'].str.contains('Oxford'),
'Oxford',
books_df['Place of Publication'].replace(
r'^\s*$', 'Unknown', regex=True
)
)
)
# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(books_df.head())
Output:
Original DataFrame:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Dept. of Computer Science & Engineering (AI & ML) Page No: 11
Data Science and its applications Lab (21AD62)
Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...
Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.
Cleaned DataFrame:
Place of Publication Date of Publication Publisher \Identifier
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh
Flickr URL
Identifier
206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...
Dept. of Computer Science & Engineering (AI & ML) Page No: 12
Data Science and its applications Lab (21AD62)
Code:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
Output:
Training Accuracy: 0.9833333333333333
Testing Accuracy: 1.0
Dept. of Computer Science & Engineering (AI & ML) Page No: 13
Data Science and its applications Lab (21AD62)
7) Train an SVM classifier on the iris dataset using sklearn. Try different kernels
and the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data.
Code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
Dept. of Computer Science & Engineering (AI & ML) Page No: 14
Data Science and its applications Lab (21AD62)
Output:
Kernel: rbf, Gamma: 0.5, C: 0.01, Accuracy: 0.3, Total Support Vectors: 120
Kernel: rbf, Gamma: 0.5, C: 1, Accuracy: 1.0, Total Support Vectors: 39
Kernel: rbf, Gamma: 0.5, C: 10, Accuracy: 1.0, Total Support Vectors: 31
Dept. of Computer Science & Engineering (AI & ML) Page No: 15
Data Science and its applications Lab (21AD62)
Code:
import numpy as np
class Node:
def __init__(self, feature=None, value=None, results=None, true_branch=None,
false_branch=None):
self.feature = feature # Feature to split on
self.value = value # Value of the feature
self.results = results # None for nodes, holds value for leaf nodes
self.true_branch = true_branch # Subtree for when the condition is true
self.false_branch = false_branch # Subtree for when the condition is false
Dept. of Computer Science & Engineering (AI & ML) Page No: 16
Data Science and its applications Lab (21AD62)
return counts
def is_numeric(value):
return isinstance(value, int) or isinstance(value, float)
def gini(rows):
counts = class_counts(rows)
impurity = 1
for lbl in counts:
prob_of_lbl = counts[lbl] / float(len(rows))
impurity -= prob_of_lbl**2
return impurity
def find_best_split(rows):
best_gain = 0
best_feature = None
current_uncertainty = gini(rows)
n_features = len(rows[0]) - 1
Dept. of Computer Science & Engineering (AI & ML) Page No: 17
Data Science and its applications Lab (21AD62)
if row[col] == value:
true_rows.append(row)
else:
false_rows.append(row)
return true_rows, false_rows
def build_tree(rows):
gain, (feature, value) = find_best_split(rows)
if gain == 0:
return Node(results=class_counts(rows))
true_rows, false_rows = split(rows, feature, value)
true_branch = build_tree(true_rows)
false_branch = build_tree(false_rows)
return Node(feature, value, true_branch=true_branch, false_branch=false_branch)
Dept. of Computer Science & Engineering (AI & ML) Page No: 18
Data Science and its applications Lab (21AD62)
Dept. of Computer Science & Engineering (AI & ML) Page No: 19
Data Science and its applications Lab (21AD62)
Output:
Decision Tree:
1:High
--> True:
2:5
--> True:
Predict {'Yes': 2}
--> False:
Predict {'No': 3}
--> False:
3:Yes
--> True:
Predict {'Yes': 4}
--> False:
1:Low
--> True:
Predict {'Yes': 2}
--> False:
Predict {'No': 2}
Dept. of Computer Science & Engineering (AI & ML) Page No: 20
Data Science and its applications Lab (21AD62)
plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
plt.title("K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.subplot(1, 3, 2)
plt.scatter(X[:, 0], X[:, 1], c=single_link_labels, cmap='viridis')
plt.title("Single-link Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.subplot(1, 3, 3)
plt.scatter(X[:, 0], X[:, 1], c=complete_link_labels, cmap='viridis')
plt.title("Complete-link Hierarchical Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
Dept. of Computer Science & Engineering (AI & ML) Page No: 22
Data Science and its applications Lab (21AD62)
Output:
Rand Index for K-means Clustering: 0.5474568137016044
Rand Index for Single-link Hierarchical Clustering: 0.1923106542083827
Rand Index for Complete-link Hierarchical Clustering: 0.16319049103784852
Dept. of Computer Science & Engineering (AI & ML) Page No: 23
Data Science and its applications Lab (21AD62)
1. Write a program to demonstrate Regression analysis with residual plots on a given data set.
Code:
import numpy as np
import matplotlib. pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def main():
# observations or data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb0 = {} \nb1 = {}".format(b[0], b[1]))
OUTPUT:
Estimated coefficients:
b0 = 1.2363636363636363
b1 = 1.1696969696969697
RESULT: The computation for Simple Linear Regression was successfully completed
Dept. of Computer Science & Engineering (AI & ML) Page No: 25
Data Science and its applications Lab (21AD62)
iris_df['Target'] = iris.target
OUTPUT:
Dept. of Computer Science & Engineering (AI & ML) Page No: 26
Data Science and its applications Lab (21AD62)
1. What are the advantages of using Visual Studio Code as an editor for Python and R?
Visual Studio Code offers a lightweight yet powerful environment with features like syntax
highlighting, code completion, debugging, and Git integration. It supports multiple languages,
including Python and R, making it convenient for developers who work with diverse technologies.
3. Why might you choose Visual Studio Code or PyCharm for coding projects involving
datasets?
Both Visual Studio Code and PyCharm offer robust features for coding, debugging, and project
management. PyCharm, specifically tailored for Python development, provides comprehensive
support for scientific libraries and tools like pandas and numpy. Visual Studio Code, while more
lightweight, offers flexibility and a wide range of extensions, making it suitable for various
programming tasks, including data analysis and visualization.
4. How would you plot a line chart in Python/R to visualize the relationship between two
variables?
By using libraries like Matplotlib in Python or ggplot2 in R to plot a line chart. Specify the
variables for the x-axis and y-axis and customize the appearance of the chart as needed.
Dept. of Computer Science & Engineering (AI & ML) Page No: 27
Data Science and its applications Lab (21AD62)
observations falling into each bin. Plot the bins along the x-axis and the frequency or density of
observations in each bin along the y-axis. This visualization helps to understand the distribution and
central tendency of the variable.
6. What steps are involved in importing and preprocessing a dataset in Python using pandas?
First, import the dataset into a pandas DataFrame. Then, identify and drop any irrelevant
columns. Next, clean and transform the data as needed, which may include changing the index,
handling missing values, and formatting fields using regular expressions or other methods.
8. What is logistic regression, and how does regularization affect its performance?
Regression is a linear classification algorithm used to predict the probability of a binary outcome
based on one or more predictor variables. Regularization introduces a penalty term to the loss
function, which helps prevent overfitting by penalizing large coefficients. A higher regularization
parameter shrinks the coefficients towards zero, leading to a simpler model with potentially better
generalization.
10. What is the ID3 algorithm, and how does it construct decision trees?
The ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used for
classification. It recursively selects the best attribute to split the data based on information gain or
another criterion and partitions the data accordingly. This process continues until all instances in a
node belong to the same class or other stopping criteria are met.
Dept. of Computer Science & Engineering (AI & ML) Page No: 28
Data Science and its applications Lab (21AD62)
11. What are some evaluation metrics used to assess the performance of clustering algorithms?
Common evaluation metrics for clustering include the Rand Index, Silhouette Score, and Davies-
Bouldin Index. These metrics assess aspects such as cluster separation, cohesion, and similarity to
ground truth labels.
12. How can web scraping be used to gather data from social media platforms?
Web scraping involves extracting data from websites using automated scripts or tools. Social
media platforms often provide APIs (Application Programming Interfaces) that allow developers to
access data programmatically. Alternatively, web scraping techniques can be employed to extract
information from public profiles or pages.
13. What are some ethical considerations when scraping data from social media platforms?
It's essential to respect the terms of service of the platform and obtain consent from users if their
data is being collected. Additionally, be mindful of privacy concerns and avoid scraping sensitive or
personal information without proper authorization.
14. Describe the process of evaluating machine learning models and why it's essential.
Evaluating machine learning models involves assessing their performance on unseen data to
understand how well they generalize to new instances. This process helps determine if the model has
learned meaningful patterns from the training data and can make accurate predictions on real-world
data. Common evaluation techniques include splitting the data into training and testing sets, cross-
validation, and using metrics such as accuracy, precision, recall, and F1-score. Evaluating models
ensures that they meet the desired level of performance and reliability for the intended application.
Additionally, it helps identify areas for improvement and guides the selection of appropriate
algorithms and hyperparameters.
***END***
Dept. of Computer Science & Engineering (AI & ML) Page No: 29