0% found this document useful (0 votes)

58 views27 pages

Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh

The document implements and evaluates various classification algorithms on a social network advertising dataset. It loads and preprocesses the dataset, splitting it into training and test sets. It then trains and evaluates a K-Nearest Neighbors (KNN) classifier on the data, achieving 93% accuracy on the test set.

Uploaded by

Ingame Id

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views27 pages

Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh

Uploaded by

Ingame Id

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Name: Abivirshan Suresh

Reg. No.: 39110009

Colab Notebook Link:

https://colab.research.google.com/drive/1OwEllqg0Vky2mxcfJTx_YvzesQx3yCjZ?usp=sharing

Cycle 2 : Twitter Dataset - Preprocessing

Date : 20.01.2022

Add the cleaned (after removal of URLs, Mentions) tweets to a new column
named with ‘new’.
Remove hyperlinks, Twitter marks and styles
Tokenization — Tokenize the given Strings
Stemming - Reducing the Size of vocabulary.

from google.colab import files

uploaded=files.upload()

Choose Files tweet.csv

tweet.csv(text/csv) - 3135128 bytes, last modified: 1/21/2022 - 100% done
Saving tweet.csv to tweet.csv

import pandas as pd

import numpy as np

import re

import matplotlib.pyplot as plt

df=pd.read_csv("tweet.csv")

df=df.drop(labels=["id","label"],axis=1)

df.head()

0 @user when a father is dysfunctional and is s...

1 @user @user thanks for #lyft credit i can't us...

2 bihday your majesty

3 #model i love u take with u all the time in ...

4 factsguide: society now #motivation

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt

df['new']=np.vectorize(remove_pattern)(df['tweet'], "@[\w]*")
df.head()

tweet new

0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so sel...

1 @user @user thanks for #lyft credit i can't us... thanks for #lyft credit i can't use cause th...

2 bihday your majesty bihday your majesty

3 #model i love u take with u all the time in ... #model i love u take with u all the time in ...

4 factsguide: society now #motivation factsguide: society now #motivation

Removing links

df['new'] =df['new'].apply(lambda x:' '.join([w for w in x.split() if '.com' not in w]))
df.head()

tweet new

0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so selfi...

1 @user @user thanks for #lyft credit i can't us... thanks for #lyft credit i can't use cause they...

2 bihday your majesty bihday your majesty

3 #model i love u take with u all the time in ... #model i love u take with u all the time in ur...

4 factsguide: society now #motivation factsguide: society now #motivation

Remove hyperlinks, Twitter marks and styles

df['new'] =df['new'].str.replace("[^a-zA-Z]", " ")

df.head()

tweet new

0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so selfi...

Tokenization
1 @user @user thanks for #lyft credit i can't us... thanks for lyft credit i can t use cause they...

2 bihday your majesty bihday your majesty

3
import nltk
#model i love u take with u all the time in ... model i love u take with u all the time in ur...
nltk.download('punkt')

4 factsguide: society now #motivation factsguide society now motivation

tf=pd.DataFrame()

from nltk.tokenize import word_tokenize

tf['tokens']=df['new'].apply(lambda x: word_tokenize(x.lower()))

tf.head()

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.

tokens

0 [when, a, father, is, dysfunctional, and, is, ...

1 [thanks, for, lyft, credit, i, can, t, use, ca...

2 [bihday, your, majesty]

3 [model, i, love, u, take, with, u, all, the, t...

4 [factsguide, society, now, motivation]

Stemming

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

tf['tokens']=tf['tokens'].apply(lambda x: [stemmer.stem(i) for i in x])

tf.head()

tokens

0 [when, a, father, is, dysfunct, and, is, so, s...

1 [thank, for, lyft, credit, i, can, t, use, cau...

2 [bihday, your, majesti]

3 [model, i, love, u, take, with, u, all, the, t...

4 [factsguid, societi, now, motiv]

Lemmatization

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

tf['tokens'].apply(lambda x: [lemmatizer.lemmatize(i) for i in x])

tf.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Unzipping corpora/wordnet.zip.

tokens

0 [when, a, father, is, dysfunct, and, is, so, s...

1 [thank, for, lyft, credit, i, can, t, use, cau...

2 [bihday, your, majesti]

3 [model, i, love, u, take, with, u, all, the, t...

4 [factsguid, societi, now, motiv]

Removal of Stopwords

import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

tf['tokens']=tf['tokens'].apply(lambda x: [ i for i in x if(i not in stop_words)])

tf.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Unzipping corpora/stopwords.zip.

tokens

0 [father, dysfunct, selfish, drag, hi, kid, hi,...

1 [thank, lyft, credit, use, caus, offer, wheelc...

2 [bihday, majesti]

3 [model, love, u, take, u, time, ur]

4 [factsguid, societi, motiv]

tf['tokens']=tf['tokens'].apply(lambda x:' '.join([w for w in x if len(w)>3]))

tf.head()

tokens

0 father dysfunct selfish drag dysfunct

tf=tf.replace('',np.NaN)

1 thank lyft credit caus offer wheelchair disapo...

tf.dropna(axis=0,inplace=True)

tf.head()

2 bihday majesti

3 model love take time

tokens
4
0 father dysfunctfactsguid societi
selfish drag motiv
dysfunct

1 thank lyft credit caus offer wheelchair disapo...

2 bihday majesti

3 model love take time

4 factsguid societi motiv

from nltk.tokenize import word_tokenize

tokens=[]

for i in list(tf.loc[:,'tokens']):

tokens+=word_tokenize(i)

print(tokens)

['father', 'dysfunct', 'selfish', 'drag', 'dysfunct', 'thank', 'lyft', 'credit', 'cau

mpw=[]

for i in set(tokens):

if(tokens.count(i)>500):

mpw.append(i)

print(i,tokens.count(i))

thank 1580

posit 994

bihday 889

smile 930

feel 774

father 957

girl 651

work 803

time 1265

love 3245

healthi 611

need 661

come 642

great 537

take 740

happi 2106

follow 528

week 607

live 591

make 992

best 521

summer 591

friend 760

good 892

like 1249

bull 506

weekend 627

peopl 895

famili 623

life 1176

today 1105

want 779

wait 658

look 730

year 555

friday 540

beauti 663

plt.bar(mpw,[tokens.count(i) for i in mpw])

<BarContainer object of 37 artists>

check 0s completed at 10:06 PM
7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

Experiment 5: Implementation of various Classification

Algorithms

VISHWAJEET ANAND

40731127

Importing the necessary libraries

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Reading the dataset

In [2]:

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Spliting the dataset into Test and Train

In [3]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

K-Nearest Neighbor(KNN)
In [4]:

from sklearn.neighbors import KNeighborsClassifier

knn= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
knn.fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}%'.format(knn.score(X_tr
print('Accuracy of Decision Tree classifier on test set: {:.2f}%'.format(knn.score(X_test,
knn_score=knn.score(X_test, y_test)*100

Accuracy of Decision Tree classifier on training set: 91.67%

Accuracy of Decision Tree classifier on test set: 93.00%

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 1/4

7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

Naive Baye's
In [5]:

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}%'.format(nb.score(X_tra
print('Accuracy of Decision Tree classifier on test set: {:.2f}%'.format(nb.score(X_test, y
nb_score=nb.score(X_test, y_test)*100

Accuracy of Decision Tree classifier on training set: 88.33%

Accuracy of Decision Tree classifier on test set: 90.00%

Decision Tree
In [6]:

from sklearn.tree import DecisionTreeClassifier

dt= DecisionTreeClassifier(criterion='entropy', random_state=0)
dt.fit(X_train,y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}%'.format(dt.score(X_tra
print('Accuracy of Decision Tree classifier on test set: {:.2f}%'.format(dt.score(X_test, y
dt_score=dt.score(X_test, y_test)*100

Accuracy of Decision Tree classifier on training set: 100.00%

Accuracy of Decision Tree classifier on test set: 91.00%

Logistic Regression
In [7]:

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}%'.format(logreg.s
print('Accuracy of Logistic regression classifier on test set: {:.2f}%'.format(logreg.score
logreg_score=logreg.score(X_test, y_test)*100

Accuracy of Logistic regression classifier on training set: 80.33%

Accuracy of Logistic regression classifier on test set: 89.00%

Random Forest Classifier

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 2/4

7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

In [8]:

from sklearn.ensemble import RandomForestClassifier

randclas = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state =
randclas.fit(X_train, y_train)
print('Accuracy of Randaom Forest classifier on training set: {:.2f}%'.format(randclas.scor
print('Accuracy of Random forest classifier on testing set: {:.2f}%'.format(randclas.score(
randclas_score=randclas.score(X_test, y_test)*100

Accuracy of Randaom Forest classifier on training set: 98.00%

Accuracy of Random forest classifier on testing set: 92.00%

Support vector machine (SVM)

In [9]:

from sklearn.svm import SVC

svcclas = SVC(kernel = 'linear', random_state = 0)
svcclas.fit(X_train, y_train)
print('Accuracy of support vector machine on training set: {:.2f}%'.format(svcclas.score(X_
print('Accuracy of support vector machine on testing set: {:.2f}%'.format(svcclas.score(X_t
svcclas_score=svcclas.score(X_test,y_test)*100

Accuracy of support vector machine on training set: 80.67%

Accuracy of support vector machine on testing set: 89.00%

Gradient Boosting for Regression

In [10]:

from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor(n_estimators = 1000, max_depth = 3, min_samples_split = 5,l
gbr.fit(X_train, y_train)
print(" Accuracy of gradient boosting on training set: {:.2f}%".format(gbr.score(X_train, y
print(" Accuracy of gradient boosting on testing set: {:.2f}%".format(gbr.score(X_test, y_t
gbr_score=gbr.score(X_test, y_test)*100
gbr_score=round(gbr_score,2)

Accuracy of gradient boosting on training set: 85.98%

Accuracy of gradient boosting on testing set: 71.67%

Accuracy score of different algorithms on the test

data

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 3/4

7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

In [11]:

score=[knn_score, nb_score, dt_score, logreg_score, randclas_score, svcclas_score, gbr_scor

algorithms=['K-Nearest Neighbour', 'Naive Bayes','Decision Tree', 'Logistic Regression', 'R
for i in range(len(algorithms)):
print("The accuracy score achieved using {} is: {}% ".format(algorithms[i],str(score[i]

The accuracy score achieved using K-Nearest Neighbour is: 93.0%

The accuracy score achieved using Naive Bayes is: 90.0%

The accuracy score achieved using Decision Tree is: 91.0%

The accuracy score achieved using Logistic Regression is: 89.0%

The accuracy score achieved using Random Forest is: 92.0%

The accuracy score achieved using Support Vector Machine is: 89.0%

The accuracy score achieved using Gradient Boosting for Regression is: 71.6
7%

Graph visualization
In [12]:

sns.set(rc={'figure.figsize':(20,7)})
plt.xlabel('Algorithms')
plt.ylabel("Accuracy Scores (in %)")
sns.barplot(algorithms,score)

C:\Users\VISHWaAJEET\anaconda3\lib\site-packages\seaborn\_decorators.py:36:
FutureWarning: Pass the following variables as keyword args: x, y. From vers
ion 0.12, the only valid positional argument will be `data`, and passing oth
er arguments without an explicit keyword will result in an error or misinter
pretation.

warnings.warn(

Out[12]:

<AxesSubplot:xlabel='Algorithms', ylabel='Accuracy Scores (in %)'>

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 4/4

7/1/22, 9:51 AM Simple Linear Regression - Jupyter Notebook

Simple Linear Regression

VISHWAJEET ANAND
40731127

importing the necessary libraries

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Loading the dataset

In [2]:

dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Spliting the dataset into train and test.

In [3]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state =1)

Creating an object of LinearRegression

In [4]:

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Out[4]:

LinearRegression()

Making predictions

In [5]:

y_pred = regressor.predict(X_test)

Visualizing the results with a graph.

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 3/Simple Linear Regression.ipynb 1/2
7/1/22, 9:51 AM Simple Linear Regression - Jupyter Notebook

In [13]:

plt.scatter(X_train, y_train, color = 'black')

plt.plot(X_train,regressor.predict(X_train),color = 'red')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [7]:

from numpy import cov

from scipy.stats import pearsonr
covariance = cov(y_test,y_pred)
corr,_ = pearsonr(y_test,y_pred)
print('covariance:', covariance)
print('Pearsons correlation: %.3f' % corr)

covariance: [[5.48805612e+08 4.83847542e+08]

[4.83847542e+08 4.46019390e+08]]

Pearsons correlation: 0.978

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 3/Simple Linear Regression.ipynb 2/2

39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

Double-click (or enter) to edit

Cycle 2 : Data Pre-Processing: Building Good Training

Sets
1. Describe the dataset
2. In the given data set, count the rows that are having no value from each column
3. Replace the value 0 with NaN
4. Remove the rows with the missing values
5. Impute the missing data with mean values
F. Split the data set into training and testing sets (Split in the ratio training:testing80:20)

from google.colab import files

uploaded=files.upload()

Choose Files no files selected Upload widget is only available when the cell has been
executed in the current browser session. Please rerun this cell to enable.
Saving housing_price.csv to housing_price.csv

import pandas as pd
import numpy as np
df=pd.read_csv('housing_price.csv')

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 1 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.describe() #dataset description

Rooms Price Distance Postcode Bedroom2 Bathroo

count 34857.000000 2.724700e+04 34856.000000 34856.000000 26640.000000 26631.00000

mean 3.031012 1.050173e+06 11.184929 3116.062859 3.084647 1.62479

std 0.969933 6.414671e+05 6.788892 109.023903 0.980690 0.72421

min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.00000

25% 2.000000 6.350000e+05 6.400000 3051.000000 2.000000 1.00000

50% 3.000000 8.700000e+05 10.300000 3103.000000 3.000000 2.00000

75% 4.000000 1.295000e+06 14.000000 3156.000000 4.000000 2.00000

max 16.000000 1.120000e+07 48.100000 3978.000000 30.000000 12.00000

print(df.shape,df.size,df.ndim)

(34857, 21) 731997 2

df.isnull().sum() #null values count

Suburb 0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 1
Postcode 1
Bedroom2 8217
Bathroom 8226
Car 8728
Landsize 11810
BuildingArea 21115
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
dtype: int64

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 2 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',

'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')

df.columns.size #number of columns

df.head(10)

Suburb Address Rooms Type Price Method SellerG Date Distance

68 Studley
0 Abbotsford 2 h NaN SS Jellis 3/09/2016
St

85 Turner
1 Abbotsford 2 h 1480000.0 S Biggin 3/12/2016
St

25
2 Abbotsford Bloomburg 2 h 1035000.0 S Biggin 4/02/2016
St

18/659
3 Abbotsford 3 u NaN VB Rounds 4/02/2016
Victoria St

5 Charles
4 Abbotsford 3 h 1465000.0 SP Biggin 4/03/2017
St

40
5 Abbotsford Federation 3 h 850000.0 PI Biggin 4/03/2017
La

55a Park
6 Abbotsford 4 h 1600000.0 VB Nelson 4/06/2016
St

16 Maugie
7 Abbotsford 4 h NaN SN Nelson 6/08/2016
St

53 Turner
8 Abbotsford 2 h NaN S Biggin 6/08/2016
St

99 Turner
9 Abbotsford 2 h NaN S Collins 6/08/2016
St

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 3 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df=df.replace(0,np.NaN) #replace 0 with NaN

df.isnull().sum()

Suburb 0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 78
Postcode 1
Bedroom2 8234
Bathroom 8272
Car 10359
Landsize 14247
BuildingArea 21191
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
dtype: int64

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 4 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.head(10)

Suburb Address Rooms Type Price Method SellerG Date Distance

68 Studley
0 Abbotsford 2 h NaN SS Jellis 3/09/2016
St

85 Turner
1 Abbotsford 2 h 1480000.0 S Biggin 3/12/2016
St

25
2 Abbotsford Bloomburg 2 h 1035000.0 S Biggin 4/02/2016
St

18/659
3 Abbotsford 3 u NaN VB Rounds 4/02/2016
Victoria St

5 Charles
4 Abbotsford 3 h 1465000.0 SP Biggin 4/03/2017
St

40
5 Abbotsford Federation 3 h 850000.0 PI Biggin 4/03/2017
La

55a Park
6 Abbotsford 4 h 1600000.0 VB Nelson 4/06/2016
St

16 Maugie
7 Abbotsford 4 h NaN SN Nelson 6/08/2016
St

53 Turner
8 Abbotsford 2 h NaN S Biggin 6/08/2016
St

99 Turner
9 Abbotsford 2 h NaN S Collins 6/08/2016
St

df=df.dropna(thresh=20) #leaves row if only one NaN is present in the row

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 5 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.isnull().sum()

Suburb 0
Address 0
Rooms 0
Type 0
Price 2185
Method 0
SellerG 0
Date 0
Distance 0
Postcode 0
Bedroom2 4
Bathroom 0
Car 630
Landsize 2091
BuildingArea 1255
YearBuilt 284
CouncilArea 0
Lattitude 0
Longtitude 0
Regionname 0
Propertycount 0
dtype: int64

df=df.fillna(value=df.loc[:,df.columns].mean()) #filling mean values

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 6 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.isnull().sum() #council dtype is object

Suburb 0
Address 0
Rooms 0
Type 0
Price 0
Method 0
SellerG 0
Date 0
Distance 0
Postcode 0
Bedroom2 0
Bathroom 0
Car 0
Landsize 0
BuildingArea 0
YearBuilt 0
CouncilArea 0
Lattitude 0
Longtitude 0
Regionname 0
Propertycount 0
dtype: int64

df=df.dropna()

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 7 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.isnull().sum()

df.dtypes.value_counts()

float64 12
object 8
int64 1
dtype: int64

print(df.shape,df.size,df.ndim)

(13772, 21) 289212 2

x=df.iloc[:,:-1]
y=df.iloc[:,-1]

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 8 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

Suburb Address Rooms Type Price Method SellerG Date

25
2 Abbotsford Bloomburg 2 h 1.035000e+06 S Biggin 4/02/2016
St

4 Abbotsford 5 Charles St 3 h 1.465000e+06 SP Biggin 4/03/2017

6 Abbotsford 55a Park St 4 h 1.600000e+06 VB Nelson 4/06/2016

16 Maugie
7 Abbotsford 4 h 1.094868e+06 SN Nelson 6/08/2016
St

11 Abbotsford 124 Yarra St 3 h 1.876000e+06 S Nelson 7/05/2016

... ... ... ... ... ... ... ...

35
34849 Wollert Kingscote 3 h 5.700000e+05 SP RW 24/02/2018
Wy

15
34850 Wollert Rockgarden 3 h 1.094868e+06 SP LJ 24/02/2018
Wy

29A Murray
34853 Yarraville 2 h 8.880000e+05 SP Sweeney 24/02/2018
St

147A
34854 Yarraville 2 t 7.050000e+05 S Jas 24/02/2018
Severn St

3
34856 Yarraville Tarrengower 2 h 1.020000e+06 PI RW 24/02/2018
St

13772 rows × 20 columns

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 9 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

2 4019.0
4 4019.0
6 4019.0
7 4019.0
11 4019.0
...
34849 2940.0
34850 2940.0
34853 6543.0
34854 6543.0
34856 6543.0
Name: Propertycount, Length: 13772, dtype: float64

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y,train_size=0.8,test_size=
print("Train size :",(len(x_train)/len(x))*100)
print("Test size :",(len(x_test)/len(x))*100)

Train size : 79.99564333430148

Test size : 20.00435666569852

Collab Note Link [ Data

Pre-Processing: Building Good Collab Note Link [ Data Pre-Processing:
Training Sets ] :https://colab. Building Good Training Sets ]
research.google.com/drive/ :https://colab.research.google.com/drive/1iH
1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?
usp=sharing
oMz00HMxm-PJfBnTy-Bxvao2hVsdbl?
usp=sharing
NAME:S AbiVirshan
NAME:S AbiVirshan
Reg NO:39110009
Reg NO:39110009
Year:3rd Sem6th Year:3rd Sem6th

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 10 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 11 of 11
3/17/22, 8:26 PM ML Lab confusion matrix.ipynb - Colaboratory

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

social = pd.read_csv("Social_Network_Ads.csv")
X=social.iloc[:,[2,3]].values
y=social.iloc[:,-1].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20)
sc=StandardScaler()
sc.fit(X_train)
X_train=sc.transform(X_train)
X_test=sc.transform(X_test)
svc=SVC(kernel='linear',C=10.0,random_state=1)
svc.fit(X_train,y_train)
SVC(C=10.0, kernel='linear', random_state=1)
y_pred=svc.predict(X_test)
conf_matrix=confusion_matrix(y_true=y_test,y_pred=y_pred)
fig,ax=plt.subplots(figsize=(5,5))
ax.matshow(conf_matrix,cmap=plt.cm.Oranges,alpha=0.3)

https://colab.research.google.com/drive/1ufH6lKaXaUpJnLqEti65Bfw5iOjUopv2#scrollTo=ie-IQCP0XxjV&printMode=true 1/3
3/17/22, 8:26 PM ML Lab confusion matrix.ipynb - Colaboratory

for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
ax.text(x=j,y=i,s=conf_matrix[i,j],va='center',size='xx-large')
plt.plot(y_test,y_pred)
plt.xlabel('Predictions',fontsize=18)
plt.ylabel('Actuals',fontsize=18)
plt.title('Confusion Matrix',fontsize=18)
plt.show()

print('Precision: %.3f'%precision_score(y_test,y_pred))
print('Recall: %.3f'%recall score(y test,y pred))
https://colab.research.google.com/drive/1ufH6lKaXaUpJnLqEti65Bfw5iOjUopv2#scrollTo=ie-IQCP0XxjV&printMode=true 2/3
3/17/22, 8:26 PM ML Lab confusion matrix.ipynb - Colaboratory
print( Recall: %.3f %recall_score(y_test,y_pred))
print('Accuracy: %3f'%accuracy_score(y_test,y_pred))
print('fl_score: %.3f'%f1_score(y_test,y_pred))

Precision: 1.000

Recall: 0.645

Accuracy: 0.862500

fl_score: 0.784

check 0s completed at 8:26 PM

https://colab.research.google.com/drive/1ufH6lKaXaUpJnLqEti65Bfw5iOjUopv2#scrollTo=ie-IQCP0XxjV&printMode=true 3/3

HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
No ratings yet
HW - Regex: 1 Instructions HW - Regular Expression - 10 Points
9 pages
Glove
100% (1)
Glove
10 pages
Approaching Almost Any NLP
No ratings yet
Approaching Almost Any NLP
118 pages
Sentiment Analysis For A Resource Poor Language-Roman Urdu
No ratings yet
Sentiment Analysis For A Resource Poor Language-Roman Urdu
15 pages
Methodology
No ratings yet
Methodology
9 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
AminaRahmanK DL Lab5
No ratings yet
AminaRahmanK DL Lab5
11 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
NLP-3 A23 Pooja Gorade - Jupyter Notebook
No ratings yet
NLP-3 A23 Pooja Gorade - Jupyter Notebook
12 pages
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
No ratings yet
03 The-Different-Methods-Deal-Text-Data-Predictive-Python
16 pages
# Load The Dataset: 'News - Dataset - Pickle' 'RB'
No ratings yet
# Load The Dataset: 'News - Dataset - Pickle' 'RB'
2 pages
Artificial Neural Network Code
No ratings yet
Artificial Neural Network Code
3 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP-3 A23 Pooja Gorade - Jupyter Notebook
No ratings yet
NLP-3 A23 Pooja Gorade - Jupyter Notebook
12 pages
Mlds5 Code
No ratings yet
Mlds5 Code
7 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLP Record
No ratings yet
NLP Record
15 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
4.twitter Extraction and Analytics
No ratings yet
4.twitter Extraction and Analytics
45 pages
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
No ratings yet
NLP Labsheet-2 Sentiment Analysis Using Naive Bayes Classifier
15 pages
Tutorial 2
No ratings yet
Tutorial 2
82 pages
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
Project Report
No ratings yet
Project Report
12 pages
A7 Dsbda Sana
No ratings yet
A7 Dsbda Sana
15 pages
Complete Text Preprocessing Pipeline For Cleaning 1 6 Million Tweets
No ratings yet
Complete Text Preprocessing Pipeline For Cleaning 1 6 Million Tweets
4 pages
HW 5 Q 1
No ratings yet
HW 5 Q 1
22 pages
Sma 3
No ratings yet
Sma 3
3 pages
Assignment No - 7
No ratings yet
Assignment No - 7
4 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Python Project
No ratings yet
Python Project
2 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
13 pages
Text Processing
No ratings yet
Text Processing
5 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
No ratings yet
Anand Institute of Higher Technology Department of Computer Science and Engineering ACADEMIC YEAR: 2018-19 Mini Project Report
9 pages
Assign 5 TT
No ratings yet
Assign 5 TT
13 pages
Unit 5
No ratings yet
Unit 5
4 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Fake News Classification - Ipynb - Colaboratory
No ratings yet
Fake News Classification - Ipynb - Colaboratory
6 pages
Tutorial 3 - 206009L
No ratings yet
Tutorial 3 - 206009L
34 pages
Omkar Nimbalkar Ass3
No ratings yet
Omkar Nimbalkar Ass3
14 pages
NLP Group 22
No ratings yet
NLP Group 22
8 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
7 Idf
No ratings yet
7 Idf
5 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
HateSpeech - Ipynb - Colab
No ratings yet
HateSpeech - Ipynb - Colab
8 pages
Fake News
No ratings yet
Fake News
8 pages
Learning C++ by Creating Games with UE4
From Everand
Learning C++ by Creating Games with UE4
William Sherif
3/5 (7)
The Little Book of Sitecore® Tips: Volume 2
From Everand
The Little Book of Sitecore® Tips: Volume 2
Neil P Shack
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Chapter #5 - Deep Learning
No ratings yet
Chapter #5 - Deep Learning
34 pages
Euler Equation - Wave Equation Connection: T. H. Pulliam Stanford University
No ratings yet
Euler Equation - Wave Equation Connection: T. H. Pulliam Stanford University
12 pages
BYJU'S Answer: Study Materials
No ratings yet
BYJU'S Answer: Study Materials
13 pages
MATH 5 Sample Question 2
No ratings yet
MATH 5 Sample Question 2
2 pages
CE - 411 Structural Analysis - Matrix Stiffness Method
No ratings yet
CE - 411 Structural Analysis - Matrix Stiffness Method
4 pages
Week1 Laplace Transform MAT565
No ratings yet
Week1 Laplace Transform MAT565
9 pages
Data Structures AND Algorithms: Lecture Notes 11
No ratings yet
Data Structures AND Algorithms: Lecture Notes 11
84 pages
Curve Representation 2. Parametric Curves 3. Non Parametric Curves 4. Cubic Splines 5. Bezier Curves 6. B-Spline Curves
No ratings yet
Curve Representation 2. Parametric Curves 3. Non Parametric Curves 4. Cubic Splines 5. Bezier Curves 6. B-Spline Curves
9 pages
Functions (Algebraic) Summary MAT1510
No ratings yet
Functions (Algebraic) Summary MAT1510
1 page
Ps 4
No ratings yet
Ps 4
6 pages
6 Suffix-Tree
No ratings yet
6 Suffix-Tree
20 pages
MCC Esa99 Final
No ratings yet
MCC Esa99 Final
14 pages
Informed Search
No ratings yet
Informed Search
65 pages
Otsu Thresholding Explained: Hide Menu
No ratings yet
Otsu Thresholding Explained: Hide Menu
4 pages
Plant Location Selection by Using A Three-Step
No ratings yet
Plant Location Selection by Using A Three-Step
4 pages
AIDS-1 Uni Queestion Bank Unitwise 24-25
No ratings yet
AIDS-1 Uni Queestion Bank Unitwise 24-25
5 pages
Saddam Firas Sami
No ratings yet
Saddam Firas Sami
4 pages
Ece468 1
No ratings yet
Ece468 1
34 pages
Signals Spectra, and Signal Processing Laboratory: October 7, 2020 October 7, 2020
No ratings yet
Signals Spectra, and Signal Processing Laboratory: October 7, 2020 October 7, 2020
16 pages
MATHESH Matlab Final Output
No ratings yet
MATHESH Matlab Final Output
19 pages
Data Analysis Activity 2
No ratings yet
Data Analysis Activity 2
4 pages
PUMA: Planning Under Uncertainty With Macro-Actions: Ruijie He Emma Brunskill Nicholas Roy
No ratings yet
PUMA: Planning Under Uncertainty With Macro-Actions: Ruijie He Emma Brunskill Nicholas Roy
7 pages
Unit 4 - Week 3: Assignment 3
0% (1)
Unit 4 - Week 3: Assignment 3
5 pages
Computer Science 15 Mark Question Bank
80% (5)
Computer Science 15 Mark Question Bank
5 pages
Project On Linear Programming Problems
No ratings yet
Project On Linear Programming Problems
13 pages
Flat It Gate 2
No ratings yet
Flat It Gate 2
33 pages
Digital Signature: Presented by T.Raju 11C31A04A7
No ratings yet
Digital Signature: Presented by T.Raju 11C31A04A7
36 pages
Setting Up The Linear Programming Problem
No ratings yet
Setting Up The Linear Programming Problem
12 pages
Chapter 5-Computer Theory BY Danial I. A Cohen
67% (21)
Chapter 5-Computer Theory BY Danial I. A Cohen
19 pages