Aids - 21ad62 - Datascience Lab Manual-1

CMR INSTITUTE OF TECHNOLOGY
Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC with “A++” Grade
ITPL MAIN ROAD, BROOKFIELD, BENGALURU-560037, KARNATAKA, INDIA
Department of Artificial Intelligence and Data Science
LAB MANUAL
DATA SCIENCE AND APPLICATIONS LABORATORY
(Effective from the academic year 2023 -2024)
Course Code: 21AD62

Lab Manual – 21AD62-DSA LAB
TABLE OF CONTENTS
S. No Programs Page
1. Installation of Python/R language, Visual Studio code editors can be demonstrated
along with Kaggle data set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment
3. A study was conducted to understand the effect of number of hours the students
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on y-
axis. Use a red ‘*’ as the point character, label the axes and give the plot a title
4. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a

histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)
5. Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information
about books. Write a program to demonstrate the following.
 Import the data into a DataFrame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple
regular expression.
 Combine str methods with NumPy to clean columns
6. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C =
1e4 and report the best classification accuracy.
7. Train an SVM classifier on the iris dataset using sklearn. Try different kernels
and the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data
2
8. Consider the following dataset. Write a program to demonstrate the working of

the decision tree based ID3 algorithm.
9. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in

the dataset
corresponds to the co-ordinates of each data point. The third column corresponds
to the actual
cluster label. Compute the rand index for the following methods:
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering.
 Also visualize the dataset and which algorithm will be able to recover the true
clusters.
10. Mini Project – Simple web scrapping in social media
Course outcomes (Course Skill Set):
At the end of the course, the student will be able to:
CO 1. Identify and demonstrate data using visualization tools.

CO 2. Make use of Statistical hypothesis tests to choose the properties of data, curate and manipulate
data.
CO 3. Utilize the skills of machine learning algorithms and techniques and develop models.
CO 4. Demonstrate the construction of decision tree and data partition using clustering.
CO 5. Experiment with social network analysis and make use of natural language processing skills to
develop data driven applications.
3
CO-PO and CO-PSO Mapping

Mo
Blo P P P P P P P
dule P P P P P P P P P
oms O O O S S S S
Course Outcomes s O O O O O O O O O
Lev 1 1 1 O O O O
cove 1 2 3 4 5 6 7 8 9
el 0 1 2 1 2 3 4
red
Identify and demonstrate data using
CO1 visualization tools. L3 1 3 3 3 3 3 - - - - - 2 3 2 - 3 3
Make use of Statistical hypothesis tests

to choose the properties of data, curate
CO2 L3 2 3 3 3 3 3 - - - - - 2 3 2 - 3 3
and manipulate data.
Utilize the skills of machine learning

CO3 algorithms and techniques and develop L3 3 3 3 3 3 3 - - - - - 2 3 2 - 3 3
models.
Demonstrate the construction of

decision tree and data partition using
CO4 L3 4 3 3 3 3 3 - - - - - 2 3 2 - 3 3
clustering.
Experiment with social network

analysis and make use of natural
CO5 L3 5 3 3 3 3 3 - - - - - 2 3 2 - 3 3
language processing skills to develop
data driven applications
4
3. Write a code to plot line chart with number of hours spent studying on x-axis and score in final
exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.
Number of hrs spent studying (x) 10 9 2 15 10 16 11 16

Score in the final exam (0– 100) (y) 95 80 10 50 45 98 38 93
import matplotlib.pyplot as plt
# Provided data
hours_spent_studying = [10, 9, 2, 15, 10, 16, 11, 16]
scores_in_final_exam = [95, 80, 10, 50, 45, 98, 38, 93]
# Plotting the data

plt.plot(hours_spent_studying, scores_in_final_exam, marker='*', color='red', linestyle='-')
# Adding labels and title

plt.xlabel('Number of Hours Spent Studying')
plt.ylabel('Score in Final Exam')
plt.title('Effect of Study Hours on Exam Performance')
# Displaying the plot

plt.grid(True)
plt.show()
5
6
4.For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check

the frequency distribution of the variable ‘mpg’ (Miles per gallon)
[]
import pandas as pd
# Load the dataset
mtcars = pd.read_csv('mtcars.csv')
# Plotting histogram
plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')
# Adding labels and title
plt.xlabel('Miles per Gallon (mpg)')
plt.ylabel('Frequency')
plt.title('Histogram of Miles per Gallon (mpg)')
# Displaying the plot
plt.grid(True)
plt.show()
7
5.Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle

(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about
books. Write a program to demonstrate the following. • Import the data into a DataFrame • Find and
drop the columns which are irrelevant for the book information. • Change the Index of the DataFrame
• Tidy up fields in the data such as date of publication with the help of simple regular expression. •
Combine str methods with NumPy to clean columns
[]
import pandas as pd
import numpy as np
# Import the data into a DataFrame

books_df = pd.read_csv('BL-Flickr-Images-Book.csv')
# Display the first few rows of the DataFrame

print("Original DataFrame:")
print(books_df.head())
# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors',
'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=irrelevant_columns, inplace=True)
# Change the Index of the DataFrame

books_df.set_index('Identifier', inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular expression
books_df['Date of Publication'] = books_df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
# Combine str methods with NumPy to clean columns

books_df['Date of Publication'] = pd.to_numeric(books_df['Date of Publication'], errors='coerce')
# Display the cleaned DataFrame

print("\nCleaned DataFrame:")
print(books_df.head())
6.Train a regularized logistic regression classifier on the iris dataset

(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris dataset) using
8
sklearn. Train the model with the following hyper parameter C = 1e4 and report the best classification
accuracy.
[]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
# Load the iris dataset

iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the logistic regression classifier with C = 1e4

C = 1e4
clf = LogisticRegression(C=C, solver='lbfgs', max_iter=1000)
clf.fit(X_train_scaled, y_train)
# Predict on the testing set

y_pred = clf.predict(X_test_scaled)
# Calculate the classification accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Classification accuracy with C = 1e4:", accuracy)
7.Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyper parameters. Train model with the following set of hyper parameters RBF- kernel, gamma=0.5,
one-vs-rest classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
9
hyper parameters, find the best classification accuracy along with total number of support vectors on
the test data
[]
from sklearn.datasets import load_iris
from sklearn.svm import SVC
# Load the iris dataset

iris = load_iris()
X = iris.data
y = iris.target
# Split the dataset into training and testing sets

# Hyperparameters
kernels = ['rbf']
gammas = [0.5]
Cs = [0.01, 1, 10]
best_accuracy = 0
best_support_vectors = None
# Train SVM classifiers with different hyperparameters

for kernel in kernels:
for gamma in gammas:
for C in Cs:
clf = SVC(kernel=kernel, gamma=gamma, C=C, decision_function_shape='ovr')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
support_vectors = clf.n_support_.sum()
print(f"Kernel: {kernel}, Gamma: {gamma}, C: {C}, Accuracy: {accuracy}, Support Vectors: {suppo
rt_vectors}")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_support_vectors = support_vectors
print("\nBest classification accuracy:", best_accuracy)

print("Total number of support vectors on test data for best accuracy:", best_support_vectors)
8.Consider the following dataset. Write a program to demonstrate the working of the decision tree
based ID3 algorithm.
[]
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
10
from sklearn.preprocessing import LabelEncoder

# Define the dataset

data = {
'Price': ['Low', 'Low', 'Low', 'Low', 'Low', 'Med', 'Med', 'Med', 'Med', 'High', 'High', 'High', 'High'],
'Maintenance': ['Low', 'Med', 'Low', 'Med', 'High', 'Med', 'Med', 'High', 'High', 'Med', 'Med', 'High', 'High'],
'Capacity': [2, 4, 4, 4, 4, 4, 4, 2, 5, 4, 2, 2, 5],
'Airbag': ['No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
'Profitable': ['Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Create DataFrame
df = pd.DataFrame(data)
# Convert categorical variables to numerical form

le = LabelEncoder()
df['Price'] = le.fit_transform(df['Price'])
df['Maintenance'] = le.fit_transform(df['Maintenance'])
df['Airbag'] = le.fit_transform(df['Airbag'])
df['Profitable'] = le.fit_transform(df['Profitable'])
# Split dataset into features and target variable

X = df.drop('Profitable', axis=1)
y = df['Profitable']
# Split dataset into training set and test set

# Create Decision Tree classifer object

clf = DecisionTreeClassifier(criterion="entropy")
# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)
# Predict the response for test dataset

y_pred = clf.predict(X_test)
# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
9.Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset
corresponds to the co-ordinates of each data point. The third column corresponds to the actual cluster
label. Compute the rand index for the following methods:
• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
11
• Also visualize the dataset and which algorithm will be able to recover the true clusters.
[]
import numpy as np
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import linkage, dendrogram
# Load the dataset

data = np.loadtxt('spiral.txt')
# Extract features (coordinates) and true labels

X = data[:, :2]
true_labels = data[:, 2]
# Perform K-means clustering

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
# Perform single-link hierarchical clustering

single_link_labels = AgglomerativeClustering(n_clusters=2, linkage='single').fit_predict(X)
# Perform complete-link hierarchical clustering

complete_link_labels = AgglomerativeClustering(n_clusters=2, linkage='complete').fit_predict(X)
# Compute Rand Index for each method

rand_index_kmeans = adjusted_rand_score(true_labels, kmeans_labels)
rand_index_single_link = adjusted_rand_score(true_labels, single_link_labels)
rand_index_complete_link = adjusted_rand_score(true_labels, complete_link_labels)
# Visualize the dataset and clustering results

plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis', s=10)
plt.title('True Clusters')
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=10)
plt.title(f'K-means Clustering (ARI={rand_index_kmeans:.2f})')
plt.scatter(X[:, 0], X[:, 1], c=single_link_labels, cmap='viridis', s=10)
plt.title(f'Single-link Hierarchical Clustering (ARI={rand_index_single_link:.2f})')
plt.scatter(X[:, 0], X[:, 1], c=complete_link_labels, cmap='viridis', s=10)
plt.title(f'Complete-link Hierarchical Clustering (ARI={rand_index_complete_link:.2f})')
plt.tight_layout()
12
plt.show()
print(f"Rand Index for K-means: {rand_index_kmeans:.2f}")

print(f"Rand Index for Single-link Hierarchical Clustering: {rand_index_single_link:.2f}")
print(f"Rand Index for Complete-link Hierarchical Clustering: {rand_index_complete_link:.2f}")
10. Mini Project – Simple web scrapping in social media
[]
pip install requests beautifulsoup4
To scrape Twitter data
13
[]
import requests
from bs4 import BeautifulSoup
import re
# Function to scrape Twitter data

def scrape_twitter_hashtag(hashtag, count=10):
url = f'https://twitter.com/hashtag/{hashtag}?src=hashtag_click'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
tweets = soup.find_all('div', class_='tweet')
scraped_tweets = []
for tweet in tweets[:count]:
username = tweet.find('span', class_='username').text.strip()
timestamp = tweet.find('a', class_='tweet-timestamp')['title']
text = tweet.find('p', class_='tweet-text').text.strip()
scraped_tweets.append({'username': username, 'timestamp': timestamp, 'text': text})
return scraped_tweets
else:
print(f'Error: Failed to fetch tweets (status code {response.status_code})')
return []
# Main function
def main():
hashtag = 'python' # Change to your desired hashtag
count = 10 # Number of tweets to scrape
tweets = scrape_twitter_hashtag(hashtag, count)
for idx, tweet in enumerate(tweets, start=1):
print(f'Tweet {idx}:')
print(f'Username: {tweet["username"]}')
print(f'Timestamp: {tweet["timestamp"]}')
print(f'Text: {tweet["text"]}')
print('-' * 50)
if __name__ == '__main__':
main()
To scrape Instagram data
[]
import requests
from bs4 import BeautifulSoup
import json
# Function to scrape Instagram data
def scrape_instagram_hashtag(hashtag, count=10):
url = f'https://www.instagram.com/explore/tags/{hashtag}/'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
14
script_tag = soup.find('script', text=re.compile(r'window\._sharedData'))

json_data = json.loads(script_tag.string.split(' = ', 1)[1].rstrip(';'))
posts = json_data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']
scraped_posts = []
for post in posts[:count]:
node = post['node']
username = node['owner']['username']
caption = node['edge_media_to_caption']['edges'][0]['node']['text']
image_url = node['display_url']
scraped_posts.append({'username': username, 'caption': caption, 'image_url': image_url})
return scraped_posts
else:
print(f'Error: Failed to fetch Instagram posts (status code {response.status_code})')
return []
# Main function
def main():
hashtag = 'nature' # Change to your desired hashtag
count = 10 # Number of posts to scrape
posts = scrape_instagram_hashtag(hashtag, count)
for idx, post in enumerate(posts, start=1):
print(f'Post {idx}:')
print(f'Username: {post["username"]}')
print(f'Caption: {post["caption"]}')
print(f'Image URL: {post["image_url"]}')
print('-' * 50)
if __name__ == '__main__':
main()
15

Aids - 21ad62 - Datascience Lab Manual-1

Uploaded by

Copyright:

Available Formats

Aids - 21ad62 - Datascience Lab Manual-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aids - 21ad62 - Datascience Lab Manual-1

Uploaded by

Copyright:

Available Formats

CMR INSTITUTE OF TECHNOLOGY

Department of Artificial Intelligence and Data Science

Course Code: 21AD62

4. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a

8. Consider the following dataset. Write a program to demonstrate the working of

9. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in

Course outcomes (Course Skill Set):

At the end of the course, the student will be able to:

CO 1. Identify and demonstrate data using visualization tools.

CO-PO and CO-PSO Mapping

Make use of Statistical hypothesis tests

Utilize the skills of machine learning

Demonstrate the construction of

Experiment with social network

Number of hrs spent studying (x) 10 9 2 15 10 16 11 16

import matplotlib.pyplot as plt

# Plotting the data

# Adding labels and title

# Displaying the plot

4.For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check

5.Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle

# Import the data into a DataFrame

# Display the first few rows of the DataFrame

# Change the Index of the DataFrame

# Combine str methods with NumPy to clean columns

# Display the cleaned DataFrame

6.Train a regularized logistic regression classifier on the iris dataset

# Load the iris dataset

# Split the dataset into training and testing sets

# Standardize the features

# Train the logistic regression classifier with C = 1e4

# Predict on the testing set

# Calculate the classification accuracy

# Load the iris dataset

# Split the dataset into training and testing sets

# Train SVM classifiers with different hyperparameters

print("\nBest classification accuracy:", best_accuracy)

from sklearn.preprocessing import LabelEncoder

# Define the dataset

# Convert categorical variables to numerical form

# Split dataset into features and target variable

# Split dataset into training set and test set

# Create Decision Tree classifer object

# Train Decision Tree Classifer

# Predict the response for test dataset

# Load the dataset

# Extract features (coordinates) and true labels

# Perform K-means clustering

# Perform single-link hierarchical clustering

# Perform complete-link hierarchical clustering

# Compute Rand Index for each method

# Visualize the dataset and clustering results

print(f"Rand Index for K-means: {rand_index_kmeans:.2f}")

10. Mini Project – Simple web scrapping in social media

To scrape Twitter data

# Function to scrape Twitter data

To scrape Instagram data

script_tag = soup.find('script', text=re.compile(r'window\._sharedData'))

You might also like