109 Sourabh Vivek Chougule
109 Sourabh Vivek Chougule
Sr. Pg.
Date Lab Title
No No.
1|Page
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model and interpret the model performance.
Write a Python program to predict fruit (Apple or Orange) based
on its size & weight by applying K-Nearest Neighbour (KNN)
model on 'apples_and_oranges' dataset (Use Training data 80%
8 21-Jan-22 and Testing Data 20%). 25
Evaluate the performance of the model using Accuracy Score
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model and interpret the model performance.
Implementing the K-mean Algorithm on unsupervised data of a
mall, that contains the basic information (ID, age, gender, income,
9 14-Dec-21 27
spending score) about the customers. Finding the clusters based
on the income and spending.
Implementing the Agglomerative Hierarchical Clustering
Algorithm on unsupervised data of a mall, that contains the basic
10 16-Dec-21 30
information (ID, age, gender, income, spending score) about the
customers. Finding the clusters based on the income and spending.
Write a Python program to create an Association algorithm for
11 32
supervised classification on any dataset
Write a Python program to predict species (Setosa, Versicolor, or
Viriginica) for a new iris flower based on length & width of its
petals and sepals by applying Decision Tree model on 'iris'
12 11-Feb-22 dataset (Use Training data 80% and Testing Data 20%). 35
Evaluate the performance of the model using Accuracy Score
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model and interpret the model performance.
Write a Python program to predict species (Setosa, Versicolor, or
Viriginica) for a new iris flower based on length & width of its
petals and sepals by applying Naive Bays Classification model on
13 07-Feb-22 'iris' dataset (Use Training data 80% and Testing Data 20%). 37
Evaluate the performance of the model using Accuracy Score
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model and interpret the model performance.
Write a Python program to predict fruit (Apple or Orange) based
on its size & weight by applying Support Vector Machine (SVM)
model on 'apples_and_oranges' dataset (Use Training data 80%
14 24-Jan-22 and Testing Data 20%). 39
Evaluate the performance of the model using Accuracy Score
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model and interpret the model performance.
Write a Python program to predict species (Setosa, Versicolor, or
Viriginica) for a new iris flower based on length & width of its
petals and sepals by applying Support Vector Machine (SVM)
model on 'iris' dataset (Use Training data 80% and Testing Data
15 28-Jan-22 41
20%).
Evaluate the performance of the model using Accuracy Score
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model and interpret the model performance.
Write a Python program to predict whether a person will have
stroke or not, based on age & bmi by applying Support Vector
Machine (SVM) model on 'healthcare-dataset-stroke-data'
dataset (Use Training data 80% and Testing Data 20%).
16 04-Feb-22 43
Evaluate the performance of the model using Accuracy Score
metric, Classification Report & Confusion Matrix, AUC ROC score
for the model by tunning hyperparameters for SVM model and
interpret the model performance.
2|Page
Python Program to implement Text Mining Basics:
i. Tokenization
17 20-Dec-22 ii. Finding frequency distinct 45
iii. Removing punctuations
iv. Stemming
Program to implement Text Mining: Sentimental Analysis,
18 09-Feb-22 using RNN LSTM learning model on DataSet of tweets on an 47
airline management.
19 21-Jan-22 Implementing python visualizations on cluster data 49
3|Page
Q1) Write a Python program to Find the correlation matrix.
Ans: -
import numpy as np
# dollars
matrix = np.corrcoef(x, y)
# print matrix
print(matrix)
OUTPUT:-
[[1. 0.95750662]
[0.95750662 1. ]]
4|Page
Q2) Plot the correlation plot on dataset and visualize giving an overview of relationships among data
on iris data
Ans: -
#visualizing coeficent
np.corrcoef(data['SepalLengthCm'],data['SepalWidthCm'])
5|Page
#Plotting heatmap. its color decides degree of correaltion
sns.heatmap(data.corr())
6|Page
plt.figure(figsize=(15,8))
plt.subplot(231)
sns.scatterplot(data['PetalLengthCm'],data['SepalLengthCm'])
plt.subplot(232)
sns.scatterplot(data['PetalLengthCm'],data['SepalWidthCm'])
plt.subplot(233)
sns.scatterplot(data['PetalLengthCm'],data['PetalWidthCm'])
plt.subplot(234)
sns.scatterplot(data['PetalWidthCm'],data['SepalLengthCm'])
plt.subplot(235)
sns.scatterplot(data['PetalWidthCm'],data['SepalWidthCm'])
plt.subplot(236)
sns.scatterplot(data['PetalWidthCm'],data['PetalLengthCm'])
7|Page
#Spearmans coeficient
from scipy.stats import spearmanr
spearmanr(data['SepalLengthCm'],data['SepalWidthCm'])
8|Page
Q3) Implementing the ANOVA testing on iris dataset. Using only one independent variable i.e. Species
(iris-setosa, iris-versicolor, iris-virginica) which are categorical and sepal width as a continuous
variable.
Ans: -
dataframe_iris=pd.DataFrame(df.data,columns=['sepalLength','sepalWidth','petal
Length','petalWidth'])
ID=[]
Target=[]
9|Page
for i in range(0,150):
ID.append(i)
dataframe=pd.DataFrame(ID,columns=['ID'])
dataframe_iris_new=pd.concat([dataframe_iris,dataframe_iris1,dataframe],axis=1
)
dataframe_iris_new.columns
fig =
interaction_plot(dataframe_iris_new.sepalWidth,dataframe_iris_new.target,
dataframe_iris_new.ID,colors=['red','blue','green'],
ms=12)
dataframe_iris_new.info()
dataframe_iris_new.describe()
10 | P a g e
# To implement Anova test we have to create null hypothesis and alternate
hypothesis
# Null hypothesis=sample means are equal
# Alternate hypothesis=sample means are not equal
print(dataframe_iris_new['sepalWidth'].groupby(dataframe_iris_new['target']).m
ean())
dataframe_iris_new.mean()
stats.shapiro(dataframe_iris_new['sepalWidth'][dataframe_iris_new['target']])
(0.7824662327766418, 1.1907719276761652e-13)
OUTPUT
Interpretation: -As p-value is significant we reject null hypothesis.
11 | P a g e
p_value=stats.levene(dataframe_iris_new['sepalWidth'],dataframe_iris_new['targ
et'])
p_value
LeveneResult(statistic=55.1738582824089, pvalue=1.1695737027924642e-12)
OUTPUT
Interpretation: As p-value is significant we reject null hypothesis
F_value,P_value=stats.f_oneway(dataframe_iris_new['sepalWidth'],dataframe_iris
_new['target'])
print("F_value=",F_value,",","P_value=",P_value)
OUTPUT:
Interpretation: As f-value is greater than 1.0, samples have different mean. We reject the null
hypothesis.
12 | P a g e
Q4) Write a Python program to predict mpg (miles per gallon) for a car based on variable wt by
applying simple linear regression on 'mtcars' dataset (Use Training data 80% and Testing Data
20%).
Record the performance of model in terms of MAE, MSE, RMSE and R-squared value.
Change Training data to 70% and Testing Data 30%, compare & interpret the performance of your
model.
Ans: -
1st part-
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/mtcars.csv")
print(df)
print(df.head(5))
x = df.iloc[:,[6]].values
y = df.iloc[:,1].values
print(x)
print(x_test)
LinearRegressor = LinearRegression()
LinearRegressor.fit(x_train, y_train)
y_pred = LinearRegressor.predict(x_test)
13 | P a g e
print('Mean Squared Error: %.2f' % mean_squared_error(y_test,y_pred))
Output-----
R2-score: 0.81
2nd part -
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/mtcars.csv")
print(df)
print(df.head(5))
x = df.iloc[:,[6]].values
y = df.iloc[:,1].values
print(x)
print(x_test)
LinearRegressor = LinearRegression()
LinearRegressor.fit(x_train, y_train)
y_pred = LinearRegressor.predict(x_test)
14 | P a g e
from math import sqrt
Output: ----
R2-score: 0.84
15 | P a g e
Q5) Write a Python program to predict mpg (miles per gallon) for a car based on variables wt, cyl &
disp by applying multi-linear regression on 'mtcars' dataset (Use Training data 80% and Testing
Data 20%).
Record the performance of model in terms of MAE, MSE, RMSE and R-squared value.
Remove variable disp from the feature set and check the performance again. Compare & interpret
the performance of your model.
Ans: -
1st part-
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/mtcars.csv")
print(df)
print(df.head(5))
x = df.iloc[:,[2,3,6]].values
y = df.iloc[:,1].values
print(x)
print(x_test)
y_pred = LinearRegressor.predict(x_test)
16 | P a g e
print('Mean Squared Error: %.2f' % mean_squared_error(y_test,y_pred))
print('R2-score: %.2f' % r2_score(y_test,y_pred))
Output-----
Mean Absolute Error: 2.71
2nd part-
from google.colab import drive
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/mtcars.csv")
print(df)
print(df.head(5))
x = df.iloc[:,[2,6]].values
y = df.iloc[:,1].values
print(x)
LinearRegressor.fit(x_train, y_train)
y_pred = LinearRegressor.predict(x_test)
17 | P a g e
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print('Mean Absolute Error: %.2f' % mean_absolute_error(y_test,y_pred))
Output: ----
R2-score: 0.81
18 | P a g e
Q6) Write a Python program to predict mpg (miles per gallon) for a car based on variables wt, cyl &
disp by applying multi-linear regression on 'mtcars' dataset (Use Training data 80% and Testing
Data 20%).
Record the performance of model in terms of MAE, MSE, RMSE and R-squared value .
Replace disp by drat variable in the feature set and check the performance again. Interpret the
performance of your model.
Ans: -
1st part-
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/mtcars.csv")
print(df)
print(df.head(5))
x = df.iloc[:,[2,3,6]].values
y = df.iloc[:,1].values
print(x)
print(x_test)
LinearRegressor = LinearRegression()
LinearRegressor.fit(x_train, y_train)
y_pred = LinearRegressor.predict(x_test)
19 | P a g e
print('R2-score: %.2f' % r2_score(y_test,y_pred))
Output-----
R2-score: 0.81
2nd part -
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/mtcars.csv")
print(df)
print(df.head(5))
x = df.iloc[:,[2,5,6]].values
y = df.iloc[:,1].values
print(x)
print(x_test)
LinearRegressor = LinearRegression()
LinearRegressor.fit(x_train, y_train)
y_pred = LinearRegressor.predict(x_test)
20 | P a g e
print('Root Mean Absolute Error: %.2f' % sqrt(mean_absolute_error(y_test,y_pred)))
Output: ----
R2-score: 0.81
21 | P a g e
Q7) Write a Python program to predict fruit (Apple or Orange) based on its size & weight by applying
logistic regression on 'apples_and_oranges' dataset (Use Training data 80% and Testing Data
20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model and interpret the model performance.
Ans: -
1st part-
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/apples_and_oranges.csv")
print(df.head(5))
x = df.iloc[:,0:2].values
y = df.iloc[:,2].values.reshape(-1,1)
print(x)
#random_state = can be anything 2,43,45, these are combinations of random data to be alloted to training and
testing
print(x_test)
LogRegressor = LogisticRegression()
LogRegressor.fit(x_train, y_train)
y_pred = LogRegressor.predict(x_test)
# y_pred = LogRegressor.predict([[70,5.30]])
print(y_pred)
Output-----
22 | P a g e
2nd part-
drive.mount("/content/drive",force_remount = True)
import numpy as np
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Dataset/apples_and_oranges.csv")
print(df.head(5))
x = df.iloc[:,0:2].values
y = df.iloc[:,2].values.reshape(-1,1)
print(x)
#random_state = can be anything 2,43,45, these are combinations of random data to be alloted to training and
testing
print(x_test)
LogRegressor = LogisticRegression()
LogRegressor.fit(x_train, y_train)
y_pred = LogRegressor.predict(x_test)
# y_pred = LogRegressor.predict([[70,5.30]])
print(y_pred)
23 | P a g e
#print("Roc_Auc_Score\n", roc_auc_score(y_test, y_pred))
Output: ----
Accuracy_Score 1.0
Confusion_matrix
[[2 0]
[0 6]]
Classification_Report
accuracy 1.00 8
-------------------------------------------------------------------------------------------------------------------------------------------------------
24 | P a g e
Q8) Write a Python program to predict fruit (Apple or Orange) based on its size & weight by
applying K-Nearest Neighbour (KNN) model on 'apples_and_oranges' dataset (Use Training data
80% and Testing Data 20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model and interpret the model performance.
Ans: -
import pandas as pd
import numpy as np
df = pd.read_csv("/content/drive/My Drive/Dataset/apples_and_oranges.csv")
x = df.iloc[:, 0:2].values
y = df.iloc[:, 2].values
df.head(5)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
#y_pred = knn.predict([[w,s]])
25 | P a g e
#Evaluating KNN Model
Output: -
Classification Report :
accuracy 1.00 8
Confusion Matrix:
[[4 0]
[0 4]]
26 | P a g e
Q9) Implementing the K-mean Algorithm on unsupervised data of a mall, that contains the basic
information (ID, age, gender, income, spending score) about the customers. Finding the clusters
based on the income and spending.
Ans: -
# importing the necessary libraries
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
import warnings
warnings.filterwarnings('ignore')
OUTPUT:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
27 | P a g e
clusters = []
# Annotate arrow
ax.annotate('Possible Elbow Point', xy=(3, 140000), xytext=(3, 50000),
xycoords='data',
arrowprops=dict(arrowstyle='->', connectionstyle='arc3',
color='blue', lw=2))
plt.show()
km5 = KMeans(n_clusters=5).fit(X)
X['Labels'] = km5.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'],
palette=sns.color_palette('hls', 5))
plt.title('KMeans with 5 Clusters')
plt.show()
28 | P a g e
Output:
29 | P a g e
Q10) Implementing the Agglomerative Hierarchical Clustering Algorithm on unsupervised data of
a mall, that contains the basic information (ID, age, gender, income, spending score) about the
customers. Finding the clusters based on the income and spending.
Ans: -
# importing the necessary libraries
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
import warnings
warnings.filterwarnings('ignore')
OUTPUT:
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
30 | P a g e
from sklearn.cluster import AgglomerativeClustering
agglom = AgglomerativeClustering(n_clusters=5,
linkage='average').fit(X)
X['Labels'] = agglom.labels_
plt.figure(figsize=(12, 8))
sns.scatterplot(X['Income'], X['Score'], hue=X['Labels'],
palette=sns.color_palette('hls', 5))
plt.title('Agglomerative with 5 Clusters')
plt.show()
OUTPUT
31 | P a g e
Q11) Write a Python program to create an Association algorithm for supervised classification on any
dataset
Ans: -
import numpy as np
import pandas as pd
cd C:\Users\Dev\Desktop\Kaggle\Apriori Algorithm
data = pd.read_excel('Online_Retail.xlsx')
data.head()
data.columns
data.Country.unique()
data['Description'] = data['Description'].str.strip()
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
data = data[~data['InvoiceNo'].str.contains('C')]
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
.groupby(['InvoiceNo', 'Description'])['Quantity']
32 | P a g e
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
.groupby(['InvoiceNo', 'Description'])['Quantity']
.sum().unstack().reset_index().fillna(0)
.set_index('InvoiceNo'))
def hot_encode(x):
if(x<= 0):
return 0
if(x>= 1):
return 1
basket_encoded = basket_France.applymap(hot_encode)
basket_France = basket_encoded
basket_encoded = basket_UK.applymap(hot_encode)
basket_UK = basket_encoded
basket_encoded = basket_Por.applymap(hot_encode)
basket_Por = basket_encoded
basket_encoded = basket_Sweden.applymap(hot_encode)
basket_Sweden = basket_encoded
33 | P a g e
# Building the model
print(rules.head())
print(rules.head())
print(rules.head())
pri
nt(rules.head())
34 | P a g e
Q12) Write a Python program to predict species (Setosa, Versicolor, or Viriginica) for a new iris
flower based on length & width of its petals and sepals by applying Decision Tree model on 'iris'
dataset (Use Training data 80% and Testing Data 20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model and interpret the model performance.
Ans: -
import pandas as pd
import numpy as np
df = pd.read_csv("/content/Iris.csv")
x = df.iloc[:, 1:5].values
y = df.iloc[:, 5].values
df.head(5)
dt=DecisionTreeClassifier()
dt.fit(x_train, y_train)
y_pred = dt.predict(x_test)
35 | P a g e
y_pred1 = dt.predict([[sepal_length, sepal_width, petal_length, petal_width]])
Output: -
accuracy 0.97 30
macro avg 0.98 0.94 0.96 30
weighted avg 0.97 0.97 0.97 30
Confusion Matrix :
[[11 0 0]
[ 0 5 1]
[ 0 0 13]]
36 | P a g e
Q13) Write a Python program to predict species (Setosa, Versicolor, or Viriginica) for a new iris
flower based on length & width of its petals and sepals by applying Naive Bays Classification model
on 'iris' dataset (Use Training data 80% and Testing Data 20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model and interpret the model performance.
Ans: -
import pandas as pd
import numpy as np
df = pd.read_csv("/content/drive/My Drive/Dataset/Iris.csv")
x = df.iloc[:, 1:5].values
y = df.iloc[:, 5].values
df.head(5)
nb = GaussianNB()
nb.fit(x_train, y_train)
y_pred = nb.predict(x_test)
37 | P a g e
print("The flower belongs to ", y_pred1)
Ouptut:-
Classification Report :
accuracy 0.97 30
[ 0 5 1]
[ 0 0 13]]
38 | P a g e
Q14) Write a Python program to predict fruit (Apple or Orange) based on its size & weight by
applying Support Vector Machine (SVM) model on 'apples_and_oranges' dataset (Use Training data
80% and Testing Data 20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model and interpret the model performance.
Ans: -
import pandas as pd
import numpy as np
df = pd.read_csv("/content/apples_and_oranges.csv")
x = df.iloc[:, 0:2].values
y = df.iloc[:, 2].values
df.head(5)
svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)
# y_pred1 = svc.predict([[w,s]])
39 | P a g e
#Evaluating SVM Model
Output:-
accuracy 0.88 8
macro avg 0.90 0.88 0.87 8
weighted avg 0.90 0.88 0.87 8
Confusion Matrix :
[[4 0]
[1 3]]
40 | P a g e
Q15) Write a Python program to predict species (Setosa, Versicolor, or Viriginica) for a new iris
flower based on length & width of its petals and sepals by applying Support Vector Machine (SVM)
model on 'iris' dataset (Use Training data 80% and Testing Data 20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model and interpret the model performance.
Ans: -
gh
import pandas as pd
import numpy as np
df = pd.read_csv("/content/Iris.csv")
x = df.iloc[:, 0:2].values
y = df.iloc[:, 2].values
df.head(5)
knn = SVC(kernel="linear")#poly,sigmoid,rbf
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
41 | P a g e
#y_pred = knn.predict([[w,s]])
accuracy 0.88 8
macro avg 0.90 0.88 0.87 8
weighted avg 0.90 0.88 0.87 8
Confusion Matrix :
[[4 0]
[1 3]]
42 | P a g e
Q16) Write a Python program to predict whether a person will have stroke or not, based on age &
bmi by applying Support Vector Machine (SVM) model on 'healthcare-dataset-stroke-data' dataset
(Use Training data 80% and Testing Data 20%).
Evaluate the performance of the model using Accuracy Score metric, Classification Report &
Confusion Matrix, AUC ROC score for the model by tunning hyperparameters for SVM model and
interpret the model performance.
Ans: -
import pandas as pd
import numpy as np
df = pd.read_csv("/content/healthcare-dataset-stroke-data.csv")
x = df.iloc[:, [2,9]].values
y = df.iloc[:, 11].values
df.head(5)
svc = SVC(kernel="linear")
svc.fit(x_train, y_train)
y_pred = svc.predict(x_test)
y_pred1 = svc.predict([[age,bmi]])
if(y_pred1 == 1):
43 | P a g e
print("The person will have stroke")
else:
Output:-
Confusion Matrix :
[[948 0]
[ 34 0]]
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning:
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning:
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning:
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
44 | P a g e
Q17) Python Program to implement Text Mining Basics:
i. Tokenization
ii. Finding frequency distinct
iii. Removing punctuations
iv. Stemming
Ans: -
# Tokenization
# Importing necessary library
import pandas as pd
import numpy as np
import nltk
import os
import nltk.corpus
# sample text for performing tokenization
text = “We are learning text mining basics with python. python will help in
implementing different algorithms"
# importing word_tokenize from nltk
from nltk.tokenize import word_tokenize
# Passing the string text into word tokenize for breaking the sentences
token = word_tokenize(text)
token
Output:
Output:
45 | P a g e
# remove punctuation
import string text = "Thank you! For learning. Just adding, a few notes,
diagrams and ppts."
punct = set(string.punctuation)
text = "".join([ch for ch in tweet if ch not in punct])
print(text)
Output:
Thank you For learning Just adding a few notes diagrams and ppts
import nltk
from nltk.stem.porter import PorterStemmer
words = ["walk", "walking", "walked", "walks", "ran", "run", "running",
"runs"]
stemmer = PorterStemmer()
Output:
walk ---> walk
walking ---> walk
walked ---> walk
walks ---> walk
ran ---> ran
run ---> run
running ---> run
runs ---> run
46 | P a g e
Q18) Program to implement Text Mining: Sentimental Analysis, using RNN LSTM learning
model on DataSet of tweets on an airline management.
Ans: -
import pandas as pd
print(review_df.shape)
review_df.head(5)
sentiment_label = review_df.airline_sentiment.factorize()
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(tweet)
encoded_docs = tokenizer.texts_to_sequences(tweet)
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(vocab_size, embedding_vector_length, input_length=200))
model.add(SpatialDropout1D(0.25))
model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',
metrics=['accuracy'])
print(model.summary())
47 | P a g e
# Train the sentiment analysis model for 5 epochs on the whole dataset with a
batch size of 32 and a validation split of 20%.
history = model.fit(padded_sequence,sentiment_label[0],validation_split=0.2,
epochs=5, batch_size=32)
Output:
def predict_sentiment(text):
tw = tokenizer.texts_to_sequences([text])
tw = pad_sequences(tw,maxlen=200)
prediction = int(model.predict(tw).round().item())
print("Predicted label: ", sentiment_label[1][prediction])
Output:
48 | P a g e
Q19) Implementing python visualizations on cluster data
Ans: -
# Import pandas and CSV file I/O library
import pandas as pd
# Import seaborn, a Python graphing library
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)
49 | P a g e
# Using seaborn library to make a plot
Sns.jointplot(x=”SepalLengthCm”, y=”SepalWidthCm”, data=iris, size=5)
Output:
50 | P a g e
51 | P a g e
Q20) Creating & visualizing a simple ANN problem to understand the implementation of an artificial
neuron using python
Ans: -
Training Data:
Input 1 Input 2 Input 3 Output
0 1 1 1
1 0 0 0
1 0 1 1
Test Data:
1 0 1 ?
Solution Program
import numpy as np
class NeuralNetwork():
def __init__(self):
# seeding for random number generation
np.random.seed(1)
52 | P a g e
self.synaptic_weights += adjustments
inputs = inputs.astype(float)
output = self.sigmoid(np.dot(inputs,
self.synaptic_weights))
return output
if __name__ == "__main__":
training_outputs = np.array([[0,1,1,0]]).T
OUTPUT:
53 | P a g e
54 | P a g e
Q21) Program to pre-process data of Australian weather and implementing an Artificial Neural
Network to predict the whether
Ans: -
np.random.seed(0)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
55 | P a g e
#Parsing datetime
# exploring the length of date objects
lengths = data["Date"].str.len()
lengths.value_counts()
#There don't seem to be any error in dates so parsing values into datetime
data['Date']= pd.to_datetime(data["Date"])
#Creating a collumn of year
data['year'] = data.Date.dt.year
data['month'] = data.Date.dt.month
data = encode(data, 'month', 12)
data['day'] = data.Date.dt.day
data = encode(data, 'day', 31)
data.head()
56 | P a g e
# Splitting months and days into Sine and cosine combination provides the
cyclical continuous feature. This can be used as input features to ANN.
# Splitting of Month
cyclic_month = sns.scatterplot(x="month_sin",y="month_cos",data=data,
color="#C2C4E2")
cyclic_month.set_title("Cyclic Encoding of Month")
cyclic_month.set_ylabel("Cosine Encoded Months")
cyclic_month.set_xlabel("Sine Encoded Months")
# Splitting of Day
cyclic_day = sns.scatterplot(x='day_sin',y='day_cos',data=data,
color="#C2C4E2")
cyclic_day.set_title("Cyclic Encoding of Day")
cyclic_day.set_ylabel("Cosine Encoded Day")
cyclic_day.set_xlabel("Sine Encoded Day")
57 | P a g e
# Processing the data for missing values
# Filling missing values with mode of the column in value, for categorical
variables
for i in object_cols:
data[i].fillna(data[i].mode()[0], inplace=True)
# Filling missing values with mode of the column in value, for numerical
variables
for i in num_cols:
data[i].fillna(data[i].median(), inplace=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 145460 non-null datetime64[ns]
1 Location 145460 non-null object
2 MinTemp 145460 non-null float64
3 MaxTemp 145460 non-null float64
4 Rainfall 145460 non-null float64
5 Evaporation 145460 non-null float64
6 Sunshine 145460 non-null float64
7 WindGustDir 145460 non-null object
8 WindGustSpeed 145460 non-null float64
9 WindDir9am 145460 non-null object
10 WindDir3pm 145460 non-null object
11 WindSpeed9am 145460 non-null float64
12 WindSpeed3pm 145460 non-null float64
13 Humidity9am 145460 non-null float64
14 Humidity3pm 145460 non-null float64
15 Pressure9am 145460 non-null float64
16 Pressure3pm 145460 non-null float64
17 Cloud9am 145460 non-null float64
58 | P a g e
18 Cloud3pm 145460 non-null float64
19 Temp9am 145460 non-null float64
20 Temp3pm 145460 non-null float64
21 RainToday 145460 non-null object
22 RainTomorrow 145460 non-null object
23 year 145460 non-null int64
24 month 145460 non-null int64
25 month_sin 145460 non-null float64
26 month_cos 145460 non-null float64
27 day 145460 non-null int64
28 day_sin 145460 non-null float64
29 day_cos 145460 non-null float64
dtypes: datetime64[ns](1), float64(20), int64(3), object(6)
memory usage: 33.3+ MB
target = data['RainTomorrow']
features.describe().T
# Creating Model
#Assigning X and y the status of attributes and tags
X = features.drop(["RainTomorrow"], axis=1)
y = features["RainTomorrow"]
X.shape
#Early stopping
early_stopping = callbacks.EarlyStopping(
min_delta=0.001, # minimium amount of change to count as an improvement
patience=20, # how many epochs to wait before stopping
restore_best_weights=True,
)
# Initialising the NN
model = Sequential()
# layers
59 | P a g e
model.add(Dense(units = 32, kernel_initializer = 'uniform', activation =
'relu'))
model.add(Dense(units = 16, kernel_initializer = 'uniform', activation =
'relu'))
model.add(Dropout(0.25))
model.add(Dense(units = 8, kernel_initializer = 'uniform', activation =
'relu'))
model.add(Dropout(0.5))
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation =
'sigmoid'))
Output:
Epoch 1/150
2551/2551 [==============================] - 5s 2ms/step - loss: 0.5967 -
accuracy: 0.7805 - val_loss: 0.3964 - val_accuracy: 0.7860
Epoch 2/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4413 -
accuracy: 0.7919 - val_loss: 0.3860 - val_accuracy: 0.8388
Epoch 3/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4290 -
accuracy: 0.8257 - val_loss: 0.3761 - val_accuracy: 0.8400
Epoch 4/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4174 -
accuracy: 0.8295 - val_loss: 0.3712 - val_accuracy: 0.8421
Epoch 5/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4137 -
accuracy: 0.8327 - val_loss: 0.3693 - val_accuracy: 0.8436
Epoch 6/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4091 -
accuracy: 0.8338 - val_loss: 0.3669 - val_accuracy: 0.8443
Epoch 7/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4082 -
accuracy: 0.8348 - val_loss: 0.3665 - val_accuracy: 0.8441
Epoch 8/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4049 -
accuracy: 0.8354 - val_loss: 0.3650 - val_accuracy: 0.8439
Epoch 9/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.4020 -
accuracy: 0.8357 - val_loss: 0.3642 - val_accuracy: 0.8441
Epoch 10/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3977 -
accuracy: 0.8363 - val_loss: 0.3635 - val_accuracy: 0.8445
Epoch 11/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3984 -
accuracy: 0.8353 - val_loss: 0.3615 - val_accuracy: 0.8445
Epoch 12/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3953 -
accuracy: 0.8368 - val_loss: 0.3618 - val_accuracy: 0.8443
Epoch 13/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3975 -
accuracy: 0.8340 - val_loss: 0.3608 - val_accuracy: 0.8444
Epoch 14/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3908 -
accuracy: 0.8373 - val_loss: 0.3597 - val_accuracy: 0.8449
Epoch 15/150
60 | P a g e
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3859 -
accuracy: 0.8383 - val_loss: 0.3597 - val_accuracy: 0.8445
Epoch 16/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3899 -
accuracy: 0.8355 - val_loss: 0.3593 - val_accuracy: 0.8433
Epoch 17/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3889 -
accuracy: 0.8364 - val_loss: 0.3581 - val_accuracy: 0.8441
Epoch 18/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3924 -
accuracy: 0.8336 - val_loss: 0.3580 - val_accuracy: 0.8438
Epoch 19/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3886 -
accuracy: 0.8361 - val_loss: 0.3582 - val_accuracy: 0.8431
Epoch 20/150
2551/2551 [==============================] - 4s 2ms/step - loss: 0.3860 -
accuracy: 0.8352 - val_loss: 0.3578 - val_accuracy: 0.8421
history_df = pd.DataFrame(history.history)
plt.show()
history_df = pd.DataFrame(history.history)
61 | P a g e
plt.plot(history_df.loc[:, ['val_accuracy']], "#C2C4E2", label='Validation
accuracy')
print(classification_report(y_test, y_pred))
62 | P a g e
Q22) Write a Python program to prepare data, to be given to a convolutional neural network CNN and
create an Image Classifier. Use the cat and dog training and test dataset.
Ans: -
# loading the image from the path and then converting them into
# grayscale for easier covnet prob
img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
# final step-forming the training data list with numpy array of the images
training_data.append([np.array(img), np.array(label)])
# shuffling of the training data to preserve the random state of our data
63 | P a g e
shuffle(training_data)
shuffle(testing_data)
np.save('test_data.npy', testing_data)
return testing_data
'''Running the training and the testing in the dataset for our model'''
train_data = create_train_data()
test_data = process_test_data()
# train_data = np.load('train_data.npy')
# test_data = np.load('test_data.npy')
'''Creating the neural network using tensorflow'''
# Importing the required libraries
import tflearn
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
import tensorflow as tf
tf.reset_default_graph()
convnet = input_data(shape =[None, IMG_SIZE, IMG_SIZE, 1], name ='input')
64 | P a g e
convnet = dropout(convnet, 0.8)
fig = plt.figure()
img_num = data[1]
img_data = data[0]
y = fig.add_subplot(4, 5, num + 1)
orig = img_data
data = img_data.reshape(IMG_SIZE, IMG_SIZE, 1)
# model_out = model.predict([data])[0]
model_out = model.predict([data])[0]
65 | P a g e
y.axes.get_xaxis().set_visible(False)
y.axes.get_yaxis().set_visible(False)
plt.show()
66 | P a g e
Q23) Write a Python program to implement RNN by building a character level prediction RNN and train in on
the text of “Harry Potter and the Philosopher’s Stone”.
Ans: -
import numpy as np
import matplotlib.pyplot as plt
class ReccurentNN:
def __init__(self, char_to_idx, idx_to_char, vocab, h_size=75,
seq_len=20, clip_value=5, epochs=50, learning_rate=1e-2):
self.n_h = h_size
self.seq_len = seq_len # number of characters in each batch/time steps
self.clip_value = clip_value # maximum allowed value for the gradients
self.epochs = epochs
self.learning_rate = learning_rate
self.char_to_idx = char_to_idx # dictionary that maps characters to an index
self.idx_to_char = idx_to_char # dictionary that maps indices to characters
self.vocab = vocab # number of unique characters in the training text
# smoothing out loss as batch SGD is noisy
self.smooth_loss = -np.log(1.0 / self.vocab) * self.seq_len
# initialize parameters
self.params = {}
X_batch = []
y_batch = []
for i in X_batch_encoded:
one_hot_char = np.zeros((1, self.vocab))
one_hot_char[0][i] = 1
X_batch.append(one_hot_char)
67 | P a g e
for j in y_batch_encoded:
one_hot_char = np.zeros((1, self.vocab))
one_hot_char[0][j] = 1
y_batch.append(one_hot_char)
return X_batch, y_batch
self.ho = h[t]
return y_pred, h
def _backward_pass(self, X, y, y_pred, h):
dh_next = np.zeros_like(h[0])
for t in reversed(range(self.seq_len)):
dy = np.copy(y_pred[t])
dy[0][np.argmax(y[t])] -= 1 # predicted y - actual y
68 | P a g e
x = np.zeros((1, self.vocab))
x[0][start_index] = 1
for i in range(test_size):
# forward propagation
h = np.tanh(np.dot(x, self.params["W_xh"]) + np.dot(self.h0, self.params["W_hh"]) +
self.params["b_h"])
y_pred = self._softmax(np.dot(h, self.params["W_hy"]) + self.params["b_y"])
# find the char with the index and concat to the output string
char = self.idx_to_char[index]
res += char
return res
def train(self, X):
J = []
num_batches = len(X) // self.seq_len
X_trimmed = X[:num_batches * self.seq_len] # trim end of the input text so that we have full
sequences
X_encoded = self._encode_text(X_trimmed) # transform words to indices to enable processing
for i in range(self.epochs):
for j in range(0, len(X_encoded) - self.seq_len, self.seq_len):
X_batch, y_batch = self._prepare_batches(X_encoded, j)
y_pred, h = self._forward_pass(X_batch)
loss = 0
for t in range(self.seq_len):
loss += -np.log(y_pred[t][0, np.argmax(y_batch[t])])
self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
J.append(self.smooth_loss)
self._backward_pass(X_batch, y_batch, y_pred, h)
self._update()
print('Epoch:', i + 1, "\tLoss:", loss, "")
return J, self.params
with open('Harry-Potter.txt') as f:
text = f.read().lower()
# use only a part of the text to make the process faster
text = text[:20000]
# text = [char for char in text if char not in ["(", ")", "\"", "'", ".", "?", "!", ",", "-"]]
# text = [char for char in text if char not in ["(", ")", "\"", "'"]]
chars = set(text)
vocab = len(chars)
# print(f"Length of training text {len(text)}")
# print(f"Size of vocabulary {vocab}")
parameter_dict = {
'char_to_idx': char_to_idx,
69 | P a g e
'idx_to_char': idx_to_char,
'vocab': vocab,
'h_size': 75,
'seq_len': 20,
# keep small to avoid diminishing/exploding gradients
'clip_value': 5,
'epochs': 50,
'learning_rate': 1e-2,
}
model = ReccurentNN(**parameter_dict)
loss, params = model.train(text)
plt.figure(figsize=(12, 8))
plt.plot([i for i in range(len(loss))], loss)
plt.ylabel("Loss")
plt.xlabel("Epochs")
plt.show()
print(model.test(50,10))
OUTPUT:
Epoch: 1 Loss: 56.938160313575075
Epoch: 2 Loss: 49.479841032771944
Epoch: 3 Loss: 44.287300754487774
Epoch: 4 Loss: 42.75894603770088
Epoch: 5 Loss: 40.962449282519785
Epoch: 6 Loss: 41.06907316142755
Epoch: 7 Loss: 39.77795494997328
Epoch: 8 Loss: 41.059521063295485
Epoch: 9 Loss: 39.848893648177594
Epoch: 10 Loss: 40.42097045126549
Epoch: 11 Loss: 39.183043247471126
Epoch: 12 Loss: 40.09713939411275
Epoch: 13 Loss: 38.786694845855145
Epoch: 14 Loss: 39.41259563289025
Epoch: 15 Loss: 38.87094988626352
Epoch: 16 Loss: 38.80896936130275
Epoch: 17 Loss: 38.65301294936609
Epoch: 18 Loss: 38.2922486206415
Epoch: 19 Loss: 38.120326247610286
Epoch: 20 Loss: 37.94743442371039
Epoch: 21 Loss: 37.781826419304245
Epoch: 22 Loss: 38.02242197941186
Epoch: 23 Loss: 37.34639374983505
Epoch: 24 Loss: 37.383830387022115
Epoch: 25 Loss: 36.863261576664286
Epoch: 26 Loss: 36.81717706027801
Epoch: 27 Loss: 35.98781618662626
70 | P a g e
Epoch: 28 Loss: 34.883143187020806
Epoch: 29 Loss: 35.74233839750379
Epoch: 30 Loss: 34.17457373354039
Epoch: 31 Loss: 34.3659838303625
Epoch: 32 Loss: 34.6155982440106
Epoch: 33 Loss: 33.428021716569035
Epoch: 34 Loss: 33.06226727751935
Epoch: 35 Loss: 33.23334401686566
Epoch: 36 Loss: 32.9818416477839
Epoch: 37 Loss: 33.155764725505655
Epoch: 38 Loss: 32.937205806520474
Epoch: 39 Loss: 32.93063638107538
Epoch: 40 Loss: 32.943368437981256
Epoch: 41 Loss: 32.92520056534523
Epoch: 42 Loss: 32.96074563399301
Epoch: 43 Loss: 32.974579784369666
Epoch: 44 Loss: 32.86483014312194
Epoch: 45 Loss: 33.10532379921245
Epoch: 46 Loss: 32.89950584889016
Epoch: 47 Loss: 33.11303116056217
Epoch: 48 Loss: 32.731237824441756
Epoch: 49 Loss: 32.742918023080314
Epoch: 50 Loss: 32.421869906086144
71 | P a g e
Q24) Write a Python program to implement GAN, to create a curve resembling a sine wave. Python
library pytorch must be used to set a random generator.
Ans: -
import math
import matplotlib.pyplot as plt
72 | P a g e
class Discriminator(nn.Module):
def __init__(self):
super().__init__()
self.model = nn.Sequential(
nn.Linear(2, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(64, 1),
nn.Sigmoid(),
)
def forward(self, x):
output = self.model(x)
return output
#instantiate a Discriminator object
discriminator = Discriminator()
generator = Generator()
73 | P a g e
# Training the discriminator
discriminator.zero_grad()
output_discriminator = discriminator(all_samples)
loss_discriminator = loss_function(
output_discriminator, all_samples_labels)
loss_discriminator.backward()
optimizer_discriminator.step()
# Show loss
if epoch % 10 == 0 and n == batch_size - 1:
print(f"Epoch: {epoch} Loss D.: {loss_discriminator}")
print(f"Epoch: {epoch} Loss G.: {loss_generator}")
74 | P a g e
75 | P a g e