0% found this document useful (0 votes)
58 views27 pages

Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh

The document implements and evaluates various classification algorithms on a social network advertising dataset. It loads and preprocesses the dataset, splitting it into training and test sets. It then trains and evaluates a K-Nearest Neighbors (KNN) classifier on the data, achieving 93% accuracy on the test set.

Uploaded by

Ingame Id
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views27 pages

Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh

The document implements and evaluates various classification algorithms on a social network advertising dataset. It loads and preprocesses the dataset, splitting it into training and test sets. It then trains and evaluates a K-Nearest Neighbors (KNN) classifier on the data, achieving 93% accuracy on the test set.

Uploaded by

Ingame Id
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Name: Abivirshan Suresh

Reg. No.: 39110009

Colab Notebook Link:


https://colab.research.google.com/drive/1OwEllqg0Vky2mxcfJTx_YvzesQx3yCjZ?usp=sharing

Cycle 2 : Twitter Dataset - Preprocessing


Date : 20.01.2022

Add the cleaned (after removal of URLs, Mentions) tweets to a new column
named with ‘new’.
Remove hyperlinks, Twitter marks and styles
Tokenization — Tokenize the given Strings
Stemming - Reducing the Size of vocabulary.

from google.colab import files

uploaded=files.upload()

Choose Files tweet.csv


tweet.csv(text/csv) - 3135128 bytes, last modified: 1/21/2022 - 100% done
Saving tweet.csv to tweet.csv

import pandas as pd

import numpy as np

import re

import matplotlib.pyplot as plt

df=pd.read_csv("tweet.csv")

df=df.drop(labels=["id","label"],axis=1)

df.head()

tweet

0 @user when a father is dysfunctional and is s...

1 @user @user thanks for #lyft credit i can't us...

2 bihday your majesty

3 #model i love u take with u all the time in ...

4 factsguide: society now #motivation

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)   
    return input_txt 

df['new']=np.vectorize(remove_pattern)(df['tweet'], "@[\w]*")
df.head()

tweet new

0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so sel...

1 @user @user thanks for #lyft credit i can't us... thanks for #lyft credit i can't use cause th...

2 bihday your majesty bihday your majesty

3 #model i love u take with u all the time in ... #model i love u take with u all the time in ...

4 factsguide: society now #motivation factsguide: society now #motivation

Removing links

df['new'] =df['new'].apply(lambda x:' '.join([w for w in x.split() if '.com' not in w]))
df.head()

tweet new

0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so selfi...

1 @user @user thanks for #lyft credit i can't us... thanks for #lyft credit i can't use cause they...

2 bihday your majesty bihday your majesty

3 #model i love u take with u all the time in ... #model i love u take with u all the time in ur...

4 factsguide: society now #motivation factsguide: society now #motivation

Remove hyperlinks, Twitter marks and styles

df['new'] =df['new'].str.replace("[^a-zA-Z]", " ")

df.head()

tweet new

0 @user when a father is dysfunctional and is s... when a father is dysfunctional and is so selfi...

Tokenization
1 @user @user thanks for #lyft credit i can't us... thanks for lyft credit i can t use cause they...

2 bihday your majesty bihday your majesty

3
import nltk
#model i love u take with u all the time in ... model i love u take with u all the time in ur...
nltk.download('punkt')

4 factsguide: society now #motivation factsguide society now motivation


tf=pd.DataFrame()

from nltk.tokenize import word_tokenize

tf['tokens']=df['new'].apply(lambda x: word_tokenize(x.lower()))

tf.head()

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.

tokens

0 [when, a, father, is, dysfunctional, and, is, ...

1 [thanks, for, lyft, credit, i, can, t, use, ca...

2 [bihday, your, majesty]

3 [model, i, love, u, take, with, u, all, the, t...

4 [factsguide, society, now, motivation]

Stemming

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

tf['tokens']=tf['tokens'].apply(lambda x: [stemmer.stem(i) for i in x])

tf.head()

tokens

0 [when, a, father, is, dysfunct, and, is, so, s...

1 [thank, for, lyft, credit, i, can, t, use, cau...

2 [bihday, your, majesti]

3 [model, i, love, u, take, with, u, all, the, t...

4 [factsguid, societi, now, motiv]

Lemmatization

import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

tf['tokens'].apply(lambda x: [lemmatizer.lemmatize(i) for i in x])

tf.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...

[nltk_data] Unzipping corpora/wordnet.zip.

tokens

0 [when, a, father, is, dysfunct, and, is, so, s...

1 [thank, for, lyft, credit, i, can, t, use, cau...

2 [bihday, your, majesti]

3 [model, i, love, u, take, with, u, all, the, t...

4 [factsguid, societi, now, motiv]

Removal of Stopwords

import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

tf['tokens']=tf['tokens'].apply(lambda x: [ i for i in x if(i not in stop_words)])

tf.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...

[nltk_data] Unzipping corpora/stopwords.zip.

tokens

0 [father, dysfunct, selfish, drag, hi, kid, hi,...

1 [thank, lyft, credit, use, caus, offer, wheelc...

2 [bihday, majesti]

3 [model, love, u, take, u, time, ur]

4 [factsguid, societi, motiv]

 tf['tokens']=tf['tokens'].apply(lambda x:' '.join([w for w in x if len(w)>3]))

 tf.head()

tokens

0 father dysfunct selfish drag dysfunct


tf=tf.replace('',np.NaN)

1 thank lyft credit caus offer wheelchair disapo...


tf.dropna(axis=0,inplace=True)

tf.head()

2 bihday majesti

3 model love take time


tokens
4
0 father dysfunctfactsguid societi
selfish drag motiv
dysfunct

1 thank lyft credit caus offer wheelchair disapo...

2 bihday majesti

3 model love take time

4 factsguid societi motiv

from nltk.tokenize import word_tokenize

tokens=[]

for i in list(tf.loc[:,'tokens']):

  tokens+=word_tokenize(i)

print(tokens)

['father', 'dysfunct', 'selfish', 'drag', 'dysfunct', 'thank', 'lyft', 'credit', 'cau

mpw=[]

for i in set(tokens):

  if(tokens.count(i)>500):

    mpw.append(i)

    print(i,tokens.count(i))

thank 1580

posit 994

bihday 889

smile 930

feel 774

father 957

girl 651

work 803

time 1265

love 3245

healthi 611

need 661

come 642

great 537

take 740

happi 2106

follow 528

week 607

live 591

make 992

best 521

summer 591

friend 760

good 892

like 1249

bull 506

weekend 627

peopl 895

famili 623

life 1176

today 1105

want 779

wait 658

look 730

year 555

friday 540

beauti 663

plt.bar(mpw,[tokens.count(i) for i in mpw])

<BarContainer object of 37 artists>


check 0s completed at 10:06 PM
7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

Experiment 5: Implementation of various Classification


Algorithms

VISHWAJEET ANAND

40731127

Importing the necessary libraries

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Reading the dataset

In [2]:

dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Spliting the dataset into Test and Train

In [3]:

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

K-Nearest Neighbor(KNN)
In [4]:

from sklearn.neighbors import KNeighborsClassifier


knn= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
knn.fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}%'.format(knn.score(X_tr
print('Accuracy of Decision Tree classifier on test set: {:.2f}%'.format(knn.score(X_test,
knn_score=knn.score(X_test, y_test)*100

Accuracy of Decision Tree classifier on training set: 91.67%

Accuracy of Decision Tree classifier on test set: 93.00%

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 1/4


7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

Naive Baye's
In [5]:

from sklearn.naive_bayes import GaussianNB


nb = GaussianNB()
nb.fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}%'.format(nb.score(X_tra
print('Accuracy of Decision Tree classifier on test set: {:.2f}%'.format(nb.score(X_test, y
nb_score=nb.score(X_test, y_test)*100

Accuracy of Decision Tree classifier on training set: 88.33%

Accuracy of Decision Tree classifier on test set: 90.00%

Decision Tree
In [6]:

from sklearn.tree import DecisionTreeClassifier


dt= DecisionTreeClassifier(criterion='entropy', random_state=0)
dt.fit(X_train,y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}%'.format(dt.score(X_tra
print('Accuracy of Decision Tree classifier on test set: {:.2f}%'.format(dt.score(X_test, y
dt_score=dt.score(X_test, y_test)*100

Accuracy of Decision Tree classifier on training set: 100.00%

Accuracy of Decision Tree classifier on test set: 91.00%

Logistic Regression
In [7]:

from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}%'.format(logreg.s
print('Accuracy of Logistic regression classifier on test set: {:.2f}%'.format(logreg.score
logreg_score=logreg.score(X_test, y_test)*100

Accuracy of Logistic regression classifier on training set: 80.33%

Accuracy of Logistic regression classifier on test set: 89.00%

Random Forest Classifier

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 2/4


7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

In [8]:

from sklearn.ensemble import RandomForestClassifier


randclas = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state =
randclas.fit(X_train, y_train)
print('Accuracy of Randaom Forest classifier on training set: {:.2f}%'.format(randclas.scor
print('Accuracy of Random forest classifier on testing set: {:.2f}%'.format(randclas.score(
randclas_score=randclas.score(X_test, y_test)*100

Accuracy of Randaom Forest classifier on training set: 98.00%

Accuracy of Random forest classifier on testing set: 92.00%

Support vector machine (SVM)


In [9]:

from sklearn.svm import SVC


svcclas = SVC(kernel = 'linear', random_state = 0)
svcclas.fit(X_train, y_train)
print('Accuracy of support vector machine on training set: {:.2f}%'.format(svcclas.score(X_
print('Accuracy of support vector machine on testing set: {:.2f}%'.format(svcclas.score(X_t
svcclas_score=svcclas.score(X_test,y_test)*100

Accuracy of support vector machine on training set: 80.67%

Accuracy of support vector machine on testing set: 89.00%

Gradient Boosting for Regression


In [10]:

from sklearn.ensemble import GradientBoostingRegressor


gbr = GradientBoostingRegressor(n_estimators = 1000, max_depth = 3, min_samples_split = 5,l
gbr.fit(X_train, y_train)
print(" Accuracy of gradient boosting on training set: {:.2f}%".format(gbr.score(X_train, y
print(" Accuracy of gradient boosting on testing set: {:.2f}%".format(gbr.score(X_test, y_t
gbr_score=gbr.score(X_test, y_test)*100
gbr_score=round(gbr_score,2)

Accuracy of gradient boosting on training set: 85.98%

Accuracy of gradient boosting on testing set: 71.67%

Accuracy score of different algorithms on the test


data

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 3/4


7/10/22, 12:37 AM Classification Algorithms - Jupyter Notebook

In [11]:

score=[knn_score, nb_score, dt_score, logreg_score, randclas_score, svcclas_score, gbr_scor


algorithms=['K-Nearest Neighbour', 'Naive Bayes','Decision Tree', 'Logistic Regression', 'R
for i in range(len(algorithms)):
print("The accuracy score achieved using {} is: {}% ".format(algorithms[i],str(score[i]

The accuracy score achieved using K-Nearest Neighbour is: 93.0%

The accuracy score achieved using Naive Bayes is: 90.0%

The accuracy score achieved using Decision Tree is: 91.0%

The accuracy score achieved using Logistic Regression is: 89.0%

The accuracy score achieved using Random Forest is: 92.0%

The accuracy score achieved using Support Vector Machine is: 89.0%

The accuracy score achieved using Gradient Boosting for Regression is: 71.6
7%

Graph visualization
In [12]:

sns.set(rc={'figure.figsize':(20,7)})
plt.xlabel('Algorithms')
plt.ylabel("Accuracy Scores (in %)")
sns.barplot(algorithms,score)

C:\Users\VISHWaAJEET\anaconda3\lib\site-packages\seaborn\_decorators.py:36:
FutureWarning: Pass the following variables as keyword args: x, y. From vers
ion 0.12, the only valid positional argument will be `data`, and passing oth
er arguments without an explicit keyword will result in an error or misinter
pretation.

warnings.warn(

Out[12]:

<AxesSubplot:xlabel='Algorithms', ylabel='Accuracy Scores (in %)'>

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 5/Classification Algorithms.ipynb 4/4


7/1/22, 9:51 AM Simple Linear Regression - Jupyter Notebook

Simple Linear Regression

VISHWAJEET ANAND
40731127

importing the necessary libraries

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Loading the dataset

In [2]:

dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

Spliting the dataset into train and test.

In [3]:

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state =1)

Creating an object of LinearRegression

In [4]:

from sklearn.linear_model import LinearRegression


regressor = LinearRegression()
regressor.fit(X_train, y_train)

Out[4]:

LinearRegression()

Making predictions

In [5]:

y_pred = regressor.predict(X_test)

Visualizing the results with a graph.


localhost:8888/notebooks/Big Data Analytics Lab/Experiment 3/Simple Linear Regression.ipynb 1/2
7/1/22, 9:51 AM Simple Linear Regression - Jupyter Notebook

In [13]:

plt.scatter(X_train, y_train, color = 'black')


plt.plot(X_train,regressor.predict(X_train),color = 'red')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

In [7]:

from numpy import cov


from scipy.stats import pearsonr
covariance = cov(y_test,y_pred)
corr,_ = pearsonr(y_test,y_pred)
print('covariance:', covariance)
print('Pearsons correlation: %.3f' % corr)

covariance: [[5.48805612e+08 4.83847542e+08]

[4.83847542e+08 4.46019390e+08]]

Pearsons correlation: 0.978

localhost:8888/notebooks/Big Data Analytics Lab/Experiment 3/Simple Linear Regression.ipynb 2/2


39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

Double-click (or enter) to edit

Double-click (or enter) to edit

Cycle 2 : Data Pre-Processing: Building Good Training


Sets
1. Describe the dataset
2. In the given data set, count the rows that are having no value from each column
3. Replace the value 0 with NaN
4. Remove the rows with the missing values
5. Impute the missing data with mean values
F. Split the data set into training and testing sets (Split in the ratio training:testing80:20)

from google.colab import files


uploaded=files.upload()

Choose Files no files selected Upload widget is only available when the cell has been
executed in the current browser session. Please rerun this cell to enable.
Saving housing_price.csv to housing_price.csv

import pandas as pd
import numpy as np
df=pd.read_csv('housing_price.csv')

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 1 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.describe() #dataset description

Rooms Price Distance Postcode Bedroom2 Bathroo

count 34857.000000 2.724700e+04 34856.000000 34856.000000 26640.000000 26631.00000

mean 3.031012 1.050173e+06 11.184929 3116.062859 3.084647 1.62479

std 0.969933 6.414671e+05 6.788892 109.023903 0.980690 0.72421

min 1.000000 8.500000e+04 0.000000 3000.000000 0.000000 0.00000

25% 2.000000 6.350000e+05 6.400000 3051.000000 2.000000 1.00000

50% 3.000000 8.700000e+05 10.300000 3103.000000 3.000000 2.00000

75% 4.000000 1.295000e+06 14.000000 3156.000000 4.000000 2.00000

max 16.000000 1.120000e+07 48.100000 3978.000000 30.000000 12.00000

print(df.shape,df.size,df.ndim)

(34857, 21) 731997 2

df.isnull().sum() #null values count

Suburb 0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 1
Postcode 1
Bedroom2 8217
Bathroom 8226
Car 8728
Landsize 11810
BuildingArea 21115
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
dtype: int64

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 2 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',


'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
'Longtitude', 'Regionname', 'Propertycount'],
dtype='object')

df.columns.size #number of columns

21

df.head(10)

Suburb Address Rooms Type Price Method SellerG Date Distance

68 Studley
0 Abbotsford 2 h NaN SS Jellis 3/09/2016
St

85 Turner
1 Abbotsford 2 h 1480000.0 S Biggin 3/12/2016
St

25
2 Abbotsford Bloomburg 2 h 1035000.0 S Biggin 4/02/2016
St

18/659
3 Abbotsford 3 u NaN VB Rounds 4/02/2016
Victoria St

5 Charles
4 Abbotsford 3 h 1465000.0 SP Biggin 4/03/2017
St

40
5 Abbotsford Federation 3 h 850000.0 PI Biggin 4/03/2017
La

55a Park
6 Abbotsford 4 h 1600000.0 VB Nelson 4/06/2016
St

16 Maugie
7 Abbotsford 4 h NaN SN Nelson 6/08/2016
St

53 Turner
8 Abbotsford 2 h NaN S Biggin 6/08/2016
St

99 Turner
9 Abbotsford 2 h NaN S Collins 6/08/2016
St

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 3 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df=df.replace(0,np.NaN) #replace 0 with NaN

df.isnull().sum()

Suburb 0
Address 0
Rooms 0
Type 0
Price 7610
Method 0
SellerG 0
Date 0
Distance 78
Postcode 1
Bedroom2 8234
Bathroom 8272
Car 10359
Landsize 14247
BuildingArea 21191
YearBuilt 19306
CouncilArea 3
Lattitude 7976
Longtitude 7976
Regionname 3
Propertycount 3
dtype: int64

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 4 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.head(10)

Suburb Address Rooms Type Price Method SellerG Date Distance

68 Studley
0 Abbotsford 2 h NaN SS Jellis 3/09/2016
St

85 Turner
1 Abbotsford 2 h 1480000.0 S Biggin 3/12/2016
St

25
2 Abbotsford Bloomburg 2 h 1035000.0 S Biggin 4/02/2016
St

18/659
3 Abbotsford 3 u NaN VB Rounds 4/02/2016
Victoria St

5 Charles
4 Abbotsford 3 h 1465000.0 SP Biggin 4/03/2017
St

40
5 Abbotsford Federation 3 h 850000.0 PI Biggin 4/03/2017
La

55a Park
6 Abbotsford 4 h 1600000.0 VB Nelson 4/06/2016
St

16 Maugie
7 Abbotsford 4 h NaN SN Nelson 6/08/2016
St

53 Turner
8 Abbotsford 2 h NaN S Biggin 6/08/2016
St

99 Turner
9 Abbotsford 2 h NaN S Collins 6/08/2016
St

df=df.dropna(thresh=20) #leaves row if only one NaN is present in the row

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 5 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.isnull().sum()

Suburb 0
Address 0
Rooms 0
Type 0
Price 2185
Method 0
SellerG 0
Date 0
Distance 0
Postcode 0
Bedroom2 4
Bathroom 0
Car 630
Landsize 2091
BuildingArea 1255
YearBuilt 284
CouncilArea 0
Lattitude 0
Longtitude 0
Regionname 0
Propertycount 0
dtype: int64

df=df.fillna(value=df.loc[:,df.columns].mean()) #filling mean values

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 6 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.isnull().sum() #council dtype is object

Suburb 0
Address 0
Rooms 0
Type 0
Price 0
Method 0
SellerG 0
Date 0
Distance 0
Postcode 0
Bedroom2 0
Bathroom 0
Car 0
Landsize 0
BuildingArea 0
YearBuilt 0
CouncilArea 0
Lattitude 0
Longtitude 0
Regionname 0
Propertycount 0
dtype: int64

df=df.dropna()

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 7 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

df.isnull().sum()

Suburb 0
Address 0
Rooms 0
Type 0
Price 0
Method 0
SellerG 0
Date 0
Distance 0
Postcode 0
Bedroom2 0
Bathroom 0
Car 0
Landsize 0
BuildingArea 0
YearBuilt 0
CouncilArea 0
Lattitude 0
Longtitude 0
Regionname 0
Propertycount 0
dtype: int64

df.dtypes.value_counts()

float64 12
object 8
int64 1
dtype: int64

print(df.shape,df.size,df.ndim)

(13772, 21) 289212 2

x=df.iloc[:,:-1]
y=df.iloc[:,-1]

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 8 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

Suburb Address Rooms Type Price Method SellerG Date

25
2 Abbotsford Bloomburg 2 h 1.035000e+06 S Biggin 4/02/2016
St

4 Abbotsford 5 Charles St 3 h 1.465000e+06 SP Biggin 4/03/2017

6 Abbotsford 55a Park St 4 h 1.600000e+06 VB Nelson 4/06/2016

16 Maugie
7 Abbotsford 4 h 1.094868e+06 SN Nelson 6/08/2016
St

11 Abbotsford 124 Yarra St 3 h 1.876000e+06 S Nelson 7/05/2016

... ... ... ... ... ... ... ...

35
34849 Wollert Kingscote 3 h 5.700000e+05 SP RW 24/02/2018
Wy

15
34850 Wollert Rockgarden 3 h 1.094868e+06 SP LJ 24/02/2018
Wy

29A Murray
34853 Yarraville 2 h 8.880000e+05 SP Sweeney 24/02/2018
St

147A
34854 Yarraville 2 t 7.050000e+05 S Jas 24/02/2018
Severn St

3
34856 Yarraville Tarrengower 2 h 1.020000e+06 PI RW 24/02/2018
St

13772 rows × 20 columns

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 9 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

2 4019.0
4 4019.0
6 4019.0
7 4019.0
11 4019.0
...
34849 2940.0
34850 2940.0
34853 6543.0
34854 6543.0
34856 6543.0
Name: Propertycount, Length: 13772, dtype: float64

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y,train_size=0.8,test_size=
print("Train size :",(len(x_train)/len(x))*100)
print("Test size :",(len(x_test)/len(x))*100)

Train size : 79.99564333430148


Test size : 20.00435666569852

Collab Note Link [ Data


Pre-Processing: Building Good Collab Note Link [ Data Pre-Processing:
Training Sets ] :https://colab. Building Good Training Sets ]
research.google.com/drive/ :https://colab.research.google.com/drive/1iH
1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?
usp=sharing
oMz00HMxm-PJfBnTy-Bxvao2hVsdbl?
usp=sharing
NAME:S AbiVirshan
NAME:S AbiVirshan
Reg NO:39110009
Reg NO:39110009
Year:3rd Sem6th Year:3rd Sem6th

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 10 of 11
39110009 AbiVirshan S.ipynb - Colaboratory 07/01/2022, 11:31 PM

https://colab.research.google.com/drive/1iHoMz00HMxm-PJfBnTy-Bxvao2hVsdbl?usp=sharing#scrollTo=oTjEENKtrkY2 Page 11 of 11
3/17/22, 8:26 PM ML Lab confusion matrix.ipynb - Colaboratory

import pandas as pd 

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix 

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score 

import matplotlib.pyplot as plt 

from sklearn.datasets import load_iris 

from sklearn.model_selection import train_test_split 

social = pd.read_csv("Social_Network_Ads.csv") 
X=social.iloc[:,[2,3]].values
y=social.iloc[:,-1].values 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20) 
sc=StandardScaler() 
sc.fit(X_train) 
X_train=sc.transform(X_train) 
X_test=sc.transform(X_test) 
svc=SVC(kernel='linear',C=10.0,random_state=1) 
svc.fit(X_train,y_train) 
SVC(C=10.0, kernel='linear', random_state=1) 
y_pred=svc.predict(X_test)
conf_matrix=confusion_matrix(y_true=y_test,y_pred=y_pred) 
fig,ax=plt.subplots(figsize=(5,5))
ax.matshow(conf_matrix,cmap=plt.cm.Oranges,alpha=0.3) 

https://colab.research.google.com/drive/1ufH6lKaXaUpJnLqEti65Bfw5iOjUopv2#scrollTo=ie-IQCP0XxjV&printMode=true 1/3
3/17/22, 8:26 PM ML Lab confusion matrix.ipynb - Colaboratory

for i in range(conf_matrix.shape[0]):
    for j in range(conf_matrix.shape[1]):
       ax.text(x=j,y=i,s=conf_matrix[i,j],va='center',size='xx-large')
plt.plot(y_test,y_pred)
plt.xlabel('Predictions',fontsize=18)
plt.ylabel('Actuals',fontsize=18)
plt.title('Confusion Matrix',fontsize=18) 
plt.show() 

print('Precision: %.3f'%precision_score(y_test,y_pred))
print('Recall: %.3f'%recall score(y test,y pred))
https://colab.research.google.com/drive/1ufH6lKaXaUpJnLqEti65Bfw5iOjUopv2#scrollTo=ie-IQCP0XxjV&printMode=true 2/3
3/17/22, 8:26 PM ML Lab confusion matrix.ipynb - Colaboratory
print( Recall: %.3f %recall_score(y_test,y_pred))
print('Accuracy: %3f'%accuracy_score(y_test,y_pred))
print('fl_score: %.3f'%f1_score(y_test,y_pred)) 

Precision: 1.000

Recall: 0.645

Accuracy: 0.862500

fl_score: 0.784

check 0s completed at 8:26 PM

https://colab.research.google.com/drive/1ufH6lKaXaUpJnLqEti65Bfw5iOjUopv2#scrollTo=ie-IQCP0XxjV&printMode=true 3/3

You might also like