0% found this document useful (0 votes)
5 views

Data analytics assignment solutions

The document outlines various assignments involving data analysis and machine learning tasks using Python. Key tasks include building logistic and linear regression models, preprocessing datasets, applying the Apriori algorithm for association rule mining, and performing sentiment analysis. Additionally, it covers text preprocessing and summarization techniques, including the removal of stopwords and the generation of word clouds.

Uploaded by

ashwinpunmagr99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data analytics assignment solutions

The document outlines various assignments involving data analysis and machine learning tasks using Python. Key tasks include building logistic and linear regression models, preprocessing datasets, applying the Apriori algorithm for association rule mining, and performing sentiment analysis. Additionally, it covers text preprocessing and summarization techniques, including the removal of stopwords and the generation of word clouds.

Uploaded by

ashwinpunmagr99
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Assignment 1 set A Q.

Create ‘User’ Data set having 5 columns namely: User ID, Gender,
Age, Estimated Salary and Purchased. Build a logistic regression
model that can predict whether on the given parameter a person
will buy a car or not.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create the User dataset


data = {'User ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Gender': ['M', 'F', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M'],
'Age': [28, 45, 23, 31, 37, 22, 33, 42, 29, 25],
'Estimated Salary': [35000, 55000, 18000, 65000, 75000, 20000,
84000, 92000, 32000, 58000],
'Purchased': [0, 1, 0, 1, 1, 0, 1, 1, 0, 1]}
df = pd.DataFrame(data)

# Convert Gender column to numeric format


df['Gender'] = pd.get_dummies(df['Gender'], drop_first=True)

# Split the dataset into training and testing data


X_train, X_test, y_train, y_test = train_test_split(df[['Gender', 'Age',
'Estimated Salary']],
df['Purchased'],
test_size=0.3,
random_state=0)

# Build the logistic regression model


lr = LogisticRegression()
lr.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = lr.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Assignment 1 Set B Q .1

Build a simple linear regression model for Fish Species Weight


Prediction.

# Import necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load the dataset


df = pd.read_csv('Fish.csv')

# Preprocess the data


df = df.drop(['Species'], axis=1) # Drop the species column as it's
categorical
df = df.dropna() # Drop any rows with missing values

# Split the dataset into training and testing sets


X = df.drop(['Weight'], axis=1) # Features
y = df['Weight'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)

# Train the model


regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Evaluate the model using the coefficient of determination (R^2


score)
r2 = r2_score(y_test, y_pred)
print("R^2 score:", r2)

# Plot the regression line and data points


plt.scatter(X_test, y_test, color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

Assignment 1 Set B Q .2

Use the iris dataset. Write a Python program to view some basic
statistical details like percentile, mean, std etc. of the species of
'Iris-setosa', 'Iris-versicolor' and 'Iris-virginica'. Apply logistic
regression on the dataset to identify different species (setosa,
versicolor, verginica) of Iris flowers given just 4 features: sepal and
petal lengths and widths.. Find the accuracy of the model.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset


iris = load_iris()

# Convert to pandas dataframe


df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Print basic statistical details of different species


print('Iris-setosa statistics:')
print(df[df['target'] == 0].describe())

print('Iris-versicolor statistics:')
print(df[df['target'] == 1].describe())

print('Iris-virginica statistics:')
print(df[df['target'] == 2].describe())

# Split data into train and test sets


X_train, X_test, y_train, y_test =
train_test_split(df[iris.feature_names], df['target'], test_size=0.2,
random_state=42)

# Fit logistic regression model


lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train, y_train)

# Predict on test set


y_pred = lr_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Assignment 2 Set B Q. 1

Download the Market basket dataset. Write a python program to


read the dataset and display its information. Preprocess the data
(drop null values etc.) Convert the categorical values into numeric
format. Apply the apriori algorithm on the above dataset to
generate the frequent itemsets and association rules.
# Importing required libraries
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Reading the dataset


data = pd.read_csv('market_basket.csv')

# Displaying the dataset information


print('Dataset Information:')
print(data.info())

# Preprocessing the data


data.dropna(inplace=True)
transactions = []
for i in range(len(data)):
transactions.append([str(data.values[i,j]) for j in range(0, 20)])

# Converting categorical values into numeric format


te = TransactionEncoder()
te_ary = te.fit_transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Applying Apriori algorithm


frequent_itemsets = apriori(df, min_support=0.01,
use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift",
min_threshold=1)

# Displaying the frequent itemsets and association rules


print('\nFrequent Itemsets:')
print(frequent_itemsets)
print('\nAssociation Rules:')
print(rules)
Assignment 2 Set B Q. 2

Download the groceries dataset. Write a python program to read


the dataset and display its information. Preprocess the data (drop
null values etc.) Convert the categorical values into numeric format.
Apply the apriori algorithm on the above dataset to generate the
frequent itemsets and association rules

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

# Load the dataset into a pandas dataframe


df = pd.read_csv('groceries.csv')

# Display information about the dataset


df.info()

# Preprocess the data by dropping any null values and converting


categorical values into numeric format
df.dropna(inplace=True)
te = TransactionEncoder()
te_ary = te.fit(df).transform(df)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Apply the Apriori algorithm on the preprocessed dataset


frequent_itemsets = apriori(df, min_support=0.01,
use_colnames=True)
association_rules = association_rules(frequent_itemsets,
metric="lift", min_threshold=1)

# Display the frequent itemsets and association rules in a readable


format
print(frequent_itemsets)
print(association_rules)
Assignment 2 Set A Q.2

Create your own transactions dataset and apply the above process
on your dataset.

from mlxtend.preprocessing import TransactionEncoder


from mlxtend.frequent_patterns import apriori, association_rules

transactions = [
['apple', 'banana', 'orange', 'grape'],
['apple', 'banana', 'grape'],
['apple', 'orange'],
['banana', 'orange', 'grape'],
['apple', 'banana', 'orange', 'kiwi'],
['orange', 'kiwi'],
['apple', 'banana', 'kiwi'],
['orange', 'grape', 'kiwi'],
['apple', 'orange', 'grape', 'kiwi'],
['apple', 'banana', 'orange', 'grape', 'kiwi']
]

# convert transactions to one-hot encoded format


te = TransactionEncoder()
one_hot = te.fit_transform(transactions)

# convert one-hot encoded format to dataframe


df = pd.DataFrame(one_hot, columns=te.columns_)

# generate frequent itemsets using Apriori algorithm


freq_itemsets = apriori(df, min_support=0.3, use_colnames=True)

# generate association rules


rules = association_rules(freq_itemsets, metric='confidence',
min_threshold=0.7)
# print frequent itemsets and association rules
print("Frequent Itemsets:")
print(freq_itemsets)
print("\nAssociation Rules:")
print(rules)

Assignment 3 Set A Q.1

Consider any text paragraph. Preprocess the text to remove any


special characters and digits. Generate the summary using
extractive summarization process

import re
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# sample text paragraph


text = "This is a sample text paragraph. It contains some special
characters like % and digits like 123. The paragraph needs to be
summarized using extractive summarization process."

# preprocess text by removing special characters and digits


text = re.sub('[^a-zA-Z]', ' ', text)

# tokenize sentences
sentences = sent_tokenize(text)

# compute sentence scores using TF-IDF


vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(sentences)
scores = X.sum(axis=1)

# select top N sentences based on scores


N=2
idx = scores.argsort(axis=0)[::-1][:N]
summary = [sentences[i] for i in idx]

# print summary
print("Summary:")
for sentence in summary:
print(sentence)

Program 8

Consider text paragraph.So, keep working. Keep striving. Never


give up. Fall down seven times, get up eight. Ease is a greater
threat to progress than hardship. Ease is a greater threat to
progress than hardship. So, keep moving, keep growing, keep
learning. See you at work.Preprocess the text to remove any
special characters and digits. Generate the summary using
extractive summarization process.

from nltk.corpus import stopwords


from nltk.probability import FreqDist

text = "So, keep working. Keep striving. Never give up. Fall down
seven times, get up eight. Ease is a greater threat to progress than
hardship. Ease is a greater threat to progress than hardship. So,
keep moving, keep growing, keep learning. See you at work."
# Remove special characters and digits
processed_text = re.sub('[^A-Za-z]+', ' ', text)

print(processed_text)

# Tokenize sentences
sentences = sent_tokenize(processed_text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_sentences = []
for sentence in sentences:
words = sentence.split()
filtered_words = [word for word in words if word.lower() not in
stop_words]
filtered_sentence = ' '.join(filtered_words)
filtered_sentences.append(filtered_sentence)

# Calculate word frequency distribution and plot frequencies


words = processed_text.split()
fdist = FreqDist(words)
fdist.plot()

wordcloud = WordCloud(width=800, height=800,


background_color='white',
stopwords=stop_words,
min_font_size=10).generate(processed_text)

# plot the WordCloud image


plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)

plt.show()

Assignment 3 Set A Q.2

Consider any text paragraph. Remove the stopwords. Tokenize the


paragraph to extract words and sentences. Calculate the word
frequency distribution and plot the frequencies. Plot the wordcloud
of the text
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# download the stopwords corpus if not already downloaded


nltk.download('stopwords')

# sample text paragraph


text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur."

# tokenize the text into words


words = word_tokenize(text)

# remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.casefold() not in
stop_words]

# tokenize the text into sentences


sentences = sent_tokenize(text)

# calculate the frequency distribution of the words


fdist = FreqDist(filtered_words)

# plot the frequency distribution of the words


fdist.plot()

# create a wordcloud of the most frequent words


wordcloud = WordCloud(width=800, height=800,
background_color='white', min_font_size=10).generate('
'.join(filtered_words))

# plot the wordcloud


plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Program 10

Download the movie_review.csv dataset from Kaggle by using the


following link
:https://www.kaggle.com/nltkdata/movie-review/version/3?select=m
ovie_review.csv to perform sentiment analysis on above dataset
and create a wordcloud.

import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# download the stopwords corpus if not already downloaded


nltk.download('stopwords')

# read the dataset


df = pd.read_csv('movie_review.csv')

# instantiate the SentimentIntensityAnalyzer object


sia = SentimentIntensityAnalyzer()
# apply sentiment analysis to each review in the dataset and store
the results in a new column
df['sentiment'] = df['review'].apply(lambda x:
sia.polarity_scores(x)['compound'])

# print the number of positive, negative, and neutral reviews


print('Positive Reviews:', len(df[df['sentiment'] > 0]))
print('Negative Reviews:', len(df[df['sentiment'] < 0]))
print('Neutral Reviews:', len(df[df['sentiment'] == 0]))

# create a wordcloud of the most frequent words in the dataset


wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords = nltk.corpus.stopwords.words('english'),
min_font_size = 10).generate(' '.join(df['review']))

# plot the wordcloud


plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

Program 11

Consider text paragraph."""Hello all, Welcome to Python


Programming Academy. Python Programming Academy is a nice
platform to learn new programming skills. It is difficult to get
enrolled in this Academy."""Remove the stopwords.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# download the stopwords corpus if not already downloaded


nltk.download('stopwords')

# set of english stopwords


stop_words = set(stopwords.words('english'))

# input text paragraph


text = "Hello all, Welcome to Python Programming Academy.
Python Programming Academy is a nice platform to learn new
programming skills. It is difficult to get enrolled in this Academy."

# tokenize the text paragraph into individual words


words = word_tokenize(text)

# remove the stopwords from the list of words


words_without_stopwords = [word for word in words if not
word.lower() in stop_words]

# join the words back into a string


text_without_stopwords = ' '.join(words_without_stopwords)

# print the text paragraph without stopwords


print(text_without_stopwords)

Program 12

Build a simple linear regression model for User Data.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

user_data = pd.read_csv('user_data.csv')
X = user_data[['age']] # independent variable
y = user_data['income'] # dependent variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)

simple_lr = LinearRegression()
simple_lr.fit(X_train, y_train)

y_pred = simple_lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

new_data = pd.DataFrame({'age': [35, 45, 55]})


simple_lr.predict(new_data)

Assignment 3 Set B Q.3

consider the following dataset :


https://www.kaggle.com/datasets/datasnaek/youtubenew?select=IN
videos.csv Write a Python script for the following : i. Read the
dataset and perform data cleaning operations on it. ii. ii. Find the
total views, total likes, total dislikes and comment count.

import pandas as pd

# Load the dataset


data = pd.read_csv('stats.csv')

# Drop any rows with missing values


data.dropna(inplace=True)

# Find the total views, likes, dislikes and comment count


total_views = data['views'].sum()
total_likes = data['likes'].sum()
total_dislikes = data['dislikes'].sum()
total_comments = data['comment_count'].sum()

# Print the results


print('Total views:', total_views)
print('Total likes:', total_likes)
print('Total dislikes:', total_dislikes)
print('Total comments:', total_comments)

Assignment 3 Set B Q.2

consider the following dataset :


https://www.kaggle.com/datasets/seungguini/youtube-commentsfor
-covid19-relatedvideos?select=covid_2021_1.csv Write a Python
script for the following : i. Read the dataset and perform data
cleaning operations on it. ii. ii. Tokenize the comments in words. iii.
Perform sentiment analysis and find the percentage of positive,
negative and neutral comments..

import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load the dataset


data = pd.read_csv('covid.csv')

# Drop any rows with missing values


data.dropna(inplace=True)

# Tokenize the comments in words


data['tokens'] = data['commentText'].apply(word_tokenize)

# Perform sentiment analysis


sid = SentimentIntensityAnalyzer()
data['sentiment'] = data['commentText'].apply(lambda x:
sid.polarity_scores(x)['compound'])

# Categorize the comments into positive, negative and neutral


based on sentiment score
data['sentiment_category'] = pd.cut(data['sentiment'], bins=3,
labels=['negative', 'neutral', 'positive'])

# Calculate the percentage of comments in each sentiment


category
sentiment_counts =
data['sentiment_category'].value_counts(normalize=True) * 100
print('Percentage of positive comments:',
sentiment_counts['positive'])
print('Percentage of negative comments:',
sentiment_counts['negative'])
print('Percentage of neutral comments:',
sentiment_counts['neutral'])

Program 15

Build a simple linear regression model for Car Dataset.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load the dataset


data = pd.read_csv('cars.csv')

# Split the dataset into features and target variable


X = data['mileage'].values.reshape(-1, 1)
y = data['price'].values.reshape(-1, 1)

# Split the dataset into training and testing sets with a 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Create a linear regression object


linreg = LinearRegression()
# Fit the training data to the model
linreg.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = linreg.predict(X_test)

# Calculate the mean squared error


mse = mean_squared_error(y_test, y_pred)

# Print the mean squared error


print('Mean squared error:', mse)

Program 16

Build a logistic regression model for Student Score Dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the dataset


data = pd.read_csv('student_scores.csv')

# Split the dataset into features and target variable


X = data.drop(['Pass/Fail'], axis=1)
y = data['Pass/Fail']

# Split the dataset into training and testing sets with a 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Create a logistic regression object
logreg = LogisticRegression()

# Fit the training data to the model


logreg.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = logreg.predict(X_test)

# Calculate the accuracy score


accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy score


print('Accuracy:', accuracy)

Program 17

Create the dataset . transactions = [['eggs', 'milk','bread'], ['eggs',


'apple'], ['milk', 'bread'], ['apple', 'milk'], ['milk', 'apple', 'bread']] .
Convert the categorical values into numeric format.Apply the
apriori algorithm on the above dataset to generate the frequent
itemsets and association rules.

from sklearn.preprocessing import LabelEncoder


from apyori import apriori

transactions = [['eggs', 'milk', 'bread'],


['eggs', 'apple'],
['milk', 'bread'],
['apple', 'milk'],
['milk', 'apple', 'bread']]

# Create a LabelEncoder object


le = LabelEncoder()

# Loop through each transaction and encode the categorical values


for transaction in transactions:
le.fit(transaction)
transaction = le.transform(transaction)
print(transaction)

# Apply the Apriori algorithm with a minimum support of 0.5


results = list(apriori(transactions, min_support=0.5))

# Print the frequent itemsets and association rules


for item in results:
print(item)

You might also like