Aids - 21ad62 - Datascience Lab Manual-1
Aids - 21ad62 - Datascience Lab Manual-1
Aids - 21ad62 - Datascience Lab Manual-1
Affiliated to VTU, Approved by AICTE, Accredited by NBA and NAAC with “A++” Grade
ITPL MAIN ROAD, BROOKFIELD, BENGALURU-560037, KARNATAKA, INDIA
LAB MANUAL
DATA SCIENCE AND APPLICATIONS LABORATORY
(Effective from the academic year 2023 -2024)
TABLE OF CONTENTS
S. No Programs Page
1. Installation of Python/R language, Visual Studio code editors can be demonstrated
along with Kaggle data set usage.
2. Write programs in Python/R and Execute them in either Visual Studio Code or
PyCharm Community Edition or any other suitable environment
3. A study was conducted to understand the effect of number of hours the students
spent studying on their performance in the final exams. Write a code to plot line
chart with number of hours spent studying on x-axis and score in final exam on y-
axis. Use a red ‘*’ as the point character, label the axes and give the plot a title
7. Train an SVM classifier on the iris dataset using sklearn. Try different kernels
and the associated hyperparameters. Train model with the following set of
hyperparameters RBFkernel, gamma=0.5, one-vs-rest classifier, no-feature-
normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
hyperparameters, find the best classification accuracy along with total number of
support vectors on the test data
2
Lab Manual – 21AD62-DSA LAB
3
Lab Manual – 21AD62-DSA LAB
4
Lab Manual – 21AD62-DSA LAB
3. Write a code to plot line chart with number of hours spent studying on x-axis and score in final
exam on y-axis. Use a red ‘*’ as the point character, label the axes and give the plot a title.
# Provided data
hours_spent_studying = [10, 9, 2, 15, 10, 16, 11, 16]
scores_in_final_exam = [95, 80, 10, 50, 45, 98, 38, 93]
5
Lab Manual – 21AD62-DSA LAB
6
Lab Manual – 21AD62-DSA LAB
[]
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
mtcars = pd.read_csv('mtcars.csv')
# Plotting histogram
plt.hist(mtcars['mpg'], bins=10, color='skyblue', edgecolor='black')
# Adding labels and title
plt.xlabel('Miles per Gallon (mpg)')
plt.ylabel('Frequency')
plt.title('Histogram of Miles per Gallon (mpg)')
# Displaying the plot
plt.grid(True)
plt.show()
7
Lab Manual – 21AD62-DSA LAB
[]
import pandas as pd
import numpy as np
# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors',
'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
books_df.drop(columns=irrelevant_columns, inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular expression
books_df['Date of Publication'] = books_df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
sklearn. Train the model with the following hyper parameter C = 1e4 and report the best classification
accuracy.
[]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
7.Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyper parameters. Train model with the following set of hyper parameters RBF- kernel, gamma=0.5,
one-vs-rest classifier, no-feature-normalization. Also try C=0.01,1,10C=0.01,1,10. For the above set of
9
Lab Manual – 21AD62-DSA LAB
hyper parameters, find the best classification accuracy along with total number of support vectors on
the test data
[]
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Hyperparameters
kernels = ['rbf']
gammas = [0.5]
Cs = [0.01, 1, 10]
best_accuracy = 0
best_support_vectors = None
8.Consider the following dataset. Write a program to demonstrate the working of the decision tree
based ID3 algorithm.
[]
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
10
Lab Manual – 21AD62-DSA LAB
# Create DataFrame
df = pd.DataFrame(data)
# Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
9.Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset
corresponds to the co-ordinates of each data point. The third column corresponds to the actual cluster
label. Compute the rand index for the following methods:
• K – means Clustering
• Single – link Hierarchical Clustering
• Complete link hierarchical clustering.
11
Lab Manual – 21AD62-DSA LAB
• Also visualize the dataset and which algorithm will be able to recover the true clusters.
[]
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import linkage, dendrogram
plt.subplot(2, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=true_labels, cmap='viridis', s=10)
plt.title('True Clusters')
plt.subplot(2, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=10)
plt.title(f'K-means Clustering (ARI={rand_index_kmeans:.2f})')
plt.subplot(2, 2, 3)
plt.scatter(X[:, 0], X[:, 1], c=single_link_labels, cmap='viridis', s=10)
plt.title(f'Single-link Hierarchical Clustering (ARI={rand_index_single_link:.2f})')
plt.subplot(2, 2, 4)
plt.scatter(X[:, 0], X[:, 1], c=complete_link_labels, cmap='viridis', s=10)
plt.title(f'Complete-link Hierarchical Clustering (ARI={rand_index_complete_link:.2f})')
plt.tight_layout()
12
Lab Manual – 21AD62-DSA LAB
plt.show()
[]
pip install requests beautifulsoup4
13
Lab Manual – 21AD62-DSA LAB
[]
import requests
from bs4 import BeautifulSoup
import re
# Main function
def main():
hashtag = 'python' # Change to your desired hashtag
count = 10 # Number of tweets to scrape
tweets = scrape_twitter_hashtag(hashtag, count)
for idx, tweet in enumerate(tweets, start=1):
print(f'Tweet {idx}:')
print(f'Username: {tweet["username"]}')
print(f'Timestamp: {tweet["timestamp"]}')
print(f'Text: {tweet["text"]}')
print('-' * 50)
if __name__ == '__main__':
main()
[]
import requests
from bs4 import BeautifulSoup
import json
# Function to scrape Instagram data
def scrape_instagram_hashtag(hashtag, count=10):
url = f'https://www.instagram.com/explore/tags/{hashtag}/'
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
14
Lab Manual – 21AD62-DSA LAB
15