Aids - 21ad62 - Datascience Lab Manual-1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

CMR INSTITUTE OF TECHNOLOGY

Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC with “A++” Grade
ITPL MAIN ROAD, BROOKFIELD, BENGALURU-560037, KARNATAKA, INDIA

Department of Artificial Intelligence and Data Science

LAB MANUAL
DATA SCIENCE AND APPLICATIONS LABORATORY
(Effective from the academic year 2023 -2024)

Course Code: 21AD62


Lab Manual – 21AD62-DSA LAB

TABLE OF CONTENTS
S. No Programs Page
1. Installation of Python/R language, Visual Studio code editors can be demonstrated
along with Kaggle data set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment
3. A study was conducted to understand the effect of number of hours the students
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on y-
axis. Use a red ‘*’ as the point character, label the axes and give the plot a title

4. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a


histogram to check the frequency distribution of the variable ‘mpg’ (Miles per
gallon)
5. Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle
(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains
information
about books. Write a program to demonstrate the following.
 Import the data into a DataFrame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple
regular expression.
 Combine str methods with NumPy to clean columns
6. Train a regularized logistic regression classifier on the iris dataset
(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris
dataset) using sklearn. Train the model with the following hyperparameter C =
1e4 and report the best classification accuracy.

7. Train an SVM classifier on the iris dataset using sklearn. Try different kernels
and the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data

2
Lab Manual – 21AD62-DSA LAB

8. Consider the following dataset. Write a program to demonstrate the working of


the decision tree based ID3 algorithm.

9. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in


the dataset
corresponds to the co-ordinates of each data point. The third column corresponds
to the actual
cluster label. Compute the rand index for the following methods:
 K – means Clustering
 Single – link Hierarchical Clustering
 Complete link hierarchical clustering.
 Also visualize the dataset and which algorithm will be able to recover the true
clusters.
10. Mini Project – Simple web scrapping in social media

Course outcomes (Course Skill Set):

At the end of the course, the student will be able to:

CO 1. Identify and demonstrate data using visualization tools.


CO 2. Make use of Statistical hypothesis tests to choose the properties of data, curate and manipulate
data.
CO 3. Utilize the skills of machine learning algorithms and techniques and develop models.
CO 4. Demonstrate the construction of decision tree and data partition using clustering.
CO 5. Experiment with social network analysis and make use of natural language processing skills to
develop data driven applications.

3
Lab Manual – 21AD62-DSA LAB

CO-PO and CO-PSO Mapping


Mo
Blo P P P P P P P
dule P P P P P P P P P
oms O O O S S S S
Course Outcomes s O O O O O O O O O
Lev 1 1 1 O O O O
cove 1 2 3 4 5 6 7 8 9
el 0 1 2 1 2 3 4
red
Identify and demonstrate data using
CO1 visualization tools. L3 1 3 3 3 3 3 - - - - - 2 3 2 - 3 3

Make use of Statistical hypothesis tests


to choose the properties of data, curate
CO2 L3 2 3 3 3 3 3 - - - - - 2 3 2 - 3 3
and manipulate data.

Utilize the skills of machine learning


CO3 algorithms and techniques and develop L3 3 3 3 3 3 3 - - - - - 2 3 2 - 3 3
models.

Demonstrate the construction of


decision tree and data partition using
CO4 L3 4 3 3 3 3 3 - - - - - 2 3 2 - 3 3
clustering.

Experiment with social network


analysis and make use of natural
CO5 L3 5 3 3 3 3 3 - - - - - 2 3 2 - 3 3
language processing skills to develop
data driven applications

4
Lab Manual – 21AD62-DSA LAB

3. Write a code to plot line chart with number of hours spent studying on x-axis and score in final
exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.

Number of hrs spent studying (x) 10 9 2 15 10 16 11 16


Score in the final exam (0– 100) (y) 95 80 10 50 45 98 38 93

import matplotlib.pyplot as plt

# Provided data
hours_spent_studying = [10, 9, 2, 15, 10, 16, 11, 16]
scores_in_final_exam = [95, 80, 10, 50, 45, 98, 38, 93]

# Plotting the data


plt.plot(hours_spent_studying, scores_in_final_exam, marker='*', color='red', linestyle='-')

# Adding labels and title


plt.xlabel('Number of Hours Spent Studying')
plt.ylabel('Score in Final Exam')
plt.title('Effect of Study Hours on Exam Performance')

# Displaying the plot


plt.grid(True)
plt.show()

5
Lab Manual – 21AD62-DSA LAB

6
Lab Manual – 21AD62-DSA LAB

4.For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check


the frequency distribution of the variable ‘mpg’ (Miles per gallon)

[]
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
mtcars = pd.read_csv('mtcars.csv')
# Plotting histogram
plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')
# Adding labels and title
plt.xlabel('Miles per Gallon (mpg)')
plt.ylabel('Frequency')
plt.title('Histogram of Miles per Gallon (mpg)')
# Displaying the plot
plt.grid(True)
plt.show()

7
Lab Manual – 21AD62-DSA LAB

5.Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle


(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about
books. Write a program to demonstrate the following. • Import the data into a DataFrame • Find and
drop the columns which are irrelevant for the book information. • Change the Index of the DataFrame
• Tidy up fields in the data such as date of publication with the help of simple regular expression. •
Combine str methods with NumPy to clean columns

[]
import pandas as pd
import numpy as np

# Import the data into a DataFrame


books_df = pd.read_csv('BL-Flickr-Images-Book.csv')

# Display the first few rows of the DataFrame


print("Original DataFrame:")
print(books_df.head())

# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors',
'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=irrelevant_columns, inplace=True)

# Change the Index of the DataFrame


books_df.set_index('Identifier', inplace=True)

# Tidy up fields in the data such as date of publication with the help of simple regular expression
books_df['Date of Publication'] = books_df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

# Combine str methods with NumPy to clean columns


books_df['Date of Publication'] = pd.to_numeric(books_df['Date of Publication'], errors='coerce')

# Display the cleaned DataFrame


print("\nCleaned DataFrame:")
print(books_df.head())

6.Train a regularized logistic regression classifier on the iris dataset


(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris dataset) using
8
Lab Manual – 21AD62-DSA LAB

sklearn. Train the model with the following hyper parameter C = 1e4 and report the best classification
accuracy.

[]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

# Load the iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the logistic regression classifier with C = 1e4


C = 1e4
clf = LogisticRegression(C=C, solver='lbfgs', max_iter=1000)
clf.fit(X_train_scaled, y_train)

# Predict on the testing set


y_pred = clf.predict(X_test_scaled)

# Calculate the classification accuracy


accuracy = accuracy_score(y_test, y_pred)
print("Classification accuracy with C = 1e4:", accuracy)

7.Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyper parameters. Train model with the following set of hyper parameters RBF- kernel, gamma=0.5,
one-vs-rest classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
9
Lab Manual – 21AD62-DSA LAB

hyper parameters, find the best classification accuracy along with total number of support vectors on
the test data

[]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Hyperparameters
kernels = ['rbf']
gammas = [0.5]
Cs = [0.01, 1, 10]

best_accuracy = 0
best_support_vectors = None

# Train SVM classifiers with different hyperparameters


for kernel in kernels:
for gamma in gammas:
for C in Cs:
clf = SVC(kernel=kernel, gamma=gamma, C=C, decision_function_shape='ovr')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
support_vectors = clf.n_support_.sum()
print(f"Kernel: {kernel}, Gamma: {gamma}, C: {C}, Accuracy: {accuracy}, Support Vectors: {suppo
rt_vectors}")
if accuracy > best_accuracy:
best_accuracy = accuracy
best_support_vectors = support_vectors

print("\nBest classification accuracy:", best_accuracy)


print("Total number of support vectors on test data for best accuracy:", best_support_vectors)

8.Consider the following dataset. Write a program to demonstrate the working of the decision tree
based ID3 algorithm.

[]
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
10
Lab Manual – 21AD62-DSA LAB

from sklearn.preprocessing import LabelEncoder


from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Define the dataset


data = {
'Price': ['Low', 'Low', 'Low', 'Low', 'Low', 'Med', 'Med', 'Med', 'Med', 'High', 'High', 'High', 'High'],
'Maintenance': ['Low', 'Med', 'Low', 'Med', 'High', 'Med', 'Med', 'High', 'High', 'Med', 'Med', 'High', 'High'],
'Capacity': [2, 4, 4, 4, 4, 4, 4, 2, 5, 4, 2, 2, 5],
'Airbag': ['No', 'Yes', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
'Profitable': ['Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}

# Create DataFrame
df = pd.DataFrame(data)

# Convert categorical variables to numerical form


le = LabelEncoder()
df['Price'] = le.fit_transform(df['Price'])
df['Maintenance'] = le.fit_transform(df['Maintenance'])
df['Airbag'] = le.fit_transform(df['Airbag'])
df['Profitable'] = le.fit_transform(df['Profitable'])

# Split dataset into features and target variable


X = df.drop('Profitable', axis=1)
y = df['Profitable']

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Decision Tree classifer object


clf = DecisionTreeClassifier(criterion="entropy")

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

# Predict the response for test dataset


y_pred = clf.predict(X_test)

# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

9.Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset
corresponds to the co-ordinates of each data point. The third column corresponds to the actual cluster
label. Compute the rand index for the following methods:
• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
11
Lab Manual – 21AD62-DSA LAB

• Also visualize the dataset and which algorithm will be able to recover the true clusters.

[]
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import linkage, dendrogram

# Load the dataset


data = np.loadtxt('spiral.txt')

# Extract features (coordinates) and true labels


X = data[:, :2]
true_labels = data[:, 2]

# Perform K-means clustering


kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# Perform single-link hierarchical clustering


single_link_labels = AgglomerativeClustering(n_clusters=2, linkage='single').fit_predict(X)

# Perform complete-link hierarchical clustering


complete_link_labels = AgglomerativeClustering(n_clusters=2, linkage='complete').fit_predict(X)

# Compute Rand Index for each method


rand_index_kmeans = adjusted_rand_score(true_labels, kmeans_labels)
rand_index_single_link = adjusted_rand_score(true_labels, single_link_labels)
rand_index_complete_link = adjusted_rand_score(true_labels, complete_link_labels)

# Visualize the dataset and clustering results


plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis', s=10)
plt.title('True Clusters')

plt.subplot(2, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=10)
plt.title(f'K-means Clustering (ARI={rand_index_kmeans:.2f})')

plt.subplot(2, 2, 3)
plt.scatter(X[:, 0], X[:, 1], c=single_link_labels, cmap='viridis', s=10)
plt.title(f'Single-link Hierarchical Clustering (ARI={rand_index_single_link:.2f})')

plt.subplot(2, 2, 4)
plt.scatter(X[:, 0], X[:, 1], c=complete_link_labels, cmap='viridis', s=10)
plt.title(f'Complete-link Hierarchical Clustering (ARI={rand_index_complete_link:.2f})')

plt.tight_layout()
12
Lab Manual – 21AD62-DSA LAB

plt.show()

print(f"Rand Index for K-means: {rand_index_kmeans:.2f}")


print(f"Rand Index for Single-link Hierarchical Clustering: {rand_index_single_link:.2f}")
print(f"Rand Index for Complete-link Hierarchical Clustering: {rand_index_complete_link:.2f}")

10. Mini Project – Simple web scrapping in social media

[]
pip install requests beautifulsoup4

To scrape Twitter data

13
Lab Manual – 21AD62-DSA LAB

[]
import requests
from bs4 import BeautifulSoup
import re

# Function to scrape Twitter data


def scrape_twitter_hashtag(hashtag, count=10):
url = f'https://twitter.com/hashtag/{hashtag}?src=hashtag_click'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
tweets = soup.find_all('div', class_='tweet')
scraped_tweets = []
for tweet in tweets[:count]:
username = tweet.find('span', class_='username').text.strip()
timestamp = tweet.find('a', class_='tweet-timestamp')['title']
text = tweet.find('p', class_='tweet-text').text.strip()
scraped_tweets.append({'username': username, 'timestamp': timestamp, 'text': text})
return scraped_tweets
else:
print(f'Error: Failed to fetch tweets (status code {response.status_code})')
return []

# Main function
def main():
hashtag = 'python' # Change to your desired hashtag
count = 10 # Number of tweets to scrape
tweets = scrape_twitter_hashtag(hashtag, count)
for idx, tweet in enumerate(tweets, start=1):
print(f'Tweet {idx}:')
print(f'Username: {tweet["username"]}')
print(f'Timestamp: {tweet["timestamp"]}')
print(f'Text: {tweet["text"]}')
print('-' * 50)

if __name__ == '__main__':
main()

To scrape Instagram data

[]
import requests
from bs4 import BeautifulSoup
import json
# Function to scrape Instagram data
def scrape_instagram_hashtag(hashtag, count=10):
url = f'https://www.instagram.com/explore/tags/{hashtag}/'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
14
Lab Manual – 21AD62-DSA LAB

script_tag = soup.find('script', text=re.compile(r'window\._sharedData'))


json_data = json.loads(script_tag.string.split(' = ', 1)[1].rstrip(';'))
posts = json_data['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']
scraped_posts = []
for post in posts[:count]:
node = post['node']
username = node['owner']['username']
caption = node['edge_media_to_caption']['edges'][0]['node']['text']
image_url = node['display_url']
scraped_posts.append({'username': username, 'caption': caption, 'image_url': image_url})
return scraped_posts
else:
print(f'Error: Failed to fetch Instagram posts (status code {response.status_code})')
return []
# Main function
def main():
hashtag = 'nature' # Change to your desired hashtag
count = 10 # Number of posts to scrape
posts = scrape_instagram_hashtag(hashtag, count)
for idx, post in enumerate(posts, start=1):
print(f'Post {idx}:')
print(f'Username: {post["username"]}')
print(f'Caption: {post["caption"]}')
print(f'Image URL: {post["image_url"]}')
print('-' * 50)
if __name__ == '__main__':
main()

15

You might also like