0% found this document useful (0 votes)
14 views98 pages

Algorithm of Data Science Journal

Uploaded by

abhinandanpaul1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views98 pages

Algorithm of Data Science Journal

Uploaded by

abhinandanpaul1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

1

UNIVERSITY OF MUMBAI
DEPARTMENT OF COMPUTER SCIENCE

M.Sc. Computer Science with Spl. in Data Science – Semester II


ALGORITHMS OD DATA SCIENCE
JOURNAL
2021-2022

Seat No.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
2

UNIVERSITY OF MUMBAI
DEPARTMENT OF COMPUTER SCIENCE

CERTIFICATE
This is to certify that the work entered in this journal was done in the University
Department of Computer Science laboratory by
Mr./Ms. ARCHANA SUKUMARAN NAIR, Seat No.
for the course of M.Sc. Computer Science with Spl. in Data Science - Semester
II (CBCS) (Revised) during the academic year 2021-2022 in a satisfactory
manner.

Subject In-charge Head of Department

External Examiner

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
3

Index

Sr. no. Name of the practical Page No. Date Sign

1 HADOOP HDFS 4

2 WORD COUNT APPLICATION 10


IN MAPREDUCE(APACHE PIG)
3 KMEANS CLUSTERING 16
ALGORITHM
4 HIERARCHICAL CLUSTERING 27

5 KNN CLASSIFIER 36

6 KNN REGRESSOR 46

7 TIME SERIES 53

8 PREDICTONG AUTHORS OF 62
DISPUTED FEDERALIST PAPER
9 SENTIMENT ANALYSIS 72

10 REGRESSION ANALYSIS 84

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
4

NAME : ARCHANA NAIR


SUBJECT : Algorithm for Data Science
COURSE : M.Sc. Computer Science with Specialization in Data Science
PRACTICAL 1

1. Start Hadoop Namenode and Datanode

NameNode is the master node in Hadoop Distributed File System (HDFS)


that manages the file system metadata while the DataNode is a slave node
in Hadoop distributed file system that stores the actual data as instructed by
the NameNode.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
5

2. Print your Hadoop version

Hadoop version command gives us the information about which version of hadoop
is currently installed in the system.

3. Print the nodes which are running

Command JPS shows which nodes are currently running for the hadoop file system.

4. Create a directory Hadoop-assign-1 in HDFS


hadoop fs mkdir command creates a directory in the hadoop file system at home
location.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
6

5. Create an empty file my-hadoop-assign.txt

hadoop fs -touchz command creates an empty text file at location.

6. Create a file called “hare_story.txt” in C:\ with the


content copied from the story Hare And Tortoise Story -
Bedtimeshortstories

Hadoop fs -put command is used to store copy at desired location.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
7

7. Append the content from the local file


“hare_story.txt” to “my-hadoop-assign.txt” in
HDFS.

Command appendToFile is used to add content of file which is in local system to the
hadoop file system.

8. Copy any local file to HDFS directly


In local system we can create a file and then it is copied to the hadoop file system.

9. View the content of HDFS with subfolders and


files.

fs -ls -R / command show all the directories and sub-directories of the hadoop file
system.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
8

10. Rename the “my-hadoop-assign.txt” to “my-


homework.txt”
Fs -mv / command renames a file name of hadoop file system.

11. View the size of the files in HDFS in terms of


KB .

Fs -du -h / command shows each of the file sizes of hadoop file system.

12. Show the web UI with the files in storage.

Web UI of the hadoop file storage which is being accessed by http:/ip-address/9870

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
9

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
10

NAME : ARCHANA NAIR


SUBJECT : Algorithm for Data Science
COURSE : M.Sc. Computer Science with Specialization in
Data Science
PRACTICAL 2
Use Apache PIG to create word count . Write the commands and display
output at each step.

A. Start hadoop dfs and yarn :


hadoop start-dfs.cmd

hadoop start-yarn.cmd

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
11

B. Copy local file to hdfs :


hadoop fs -put /C:/Users/verma/Downloads/hare.txt /

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
12

C. Loading lines of a text file from hdfs and storing in a variable


lines = load 'hdfs://localhost:9000/hare.txt' using TextLoader as
(line:chararray);
DUMP lines;

D. Seperate each words of each line into tokens ( tuples )


MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
13

words = foreach lines generate flatten(TOKENIZE(line)) as word;


DUMP words;

E. Categorising repeated words in the same group


groups = group words by word;
DUMP groups;

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
14

F. Count the number of words of each group


count = foreach groups generate group, (COUNT(words));
DUMP count;

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
15

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
16

ARCHANA NAIR
M.Sc. Computer Science (With Specialization in Data Science) University of Mumbai PRACTICAL 3 :- K-Means
Clustering

In [2]:

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [22]:

data = pd.read_csv("C:/Users/archa/Downloads/Mall_Customers.csv")
data.head()

Out[22]:

CustomerID Genre Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

In [4]:

data = data.drop(columns = ['CustomerID'])


data.head()

Out[4]:

Genre Age Annual Income (k$) Spending Score (1-100)

0 Male 19 15 39

1 Male 21 15 81

2 Female 20 16 6

3 Female 23 16 77

4 Female 31 17 40
17

In [23]:

data.dtypes

Out[23]:

CustomerID int64
Genre object
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object

In [24]:

#analyse missing values


data.isna().sum()

Out[24]:

CustomerID 0
Genre 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64

In [ ]:
18

In [5]:

sns.pairplot(data)

Out[5]:

<seaborn.axisgrid.PairGrid at 0x20172d70730>
19

In [6]:

data.columns

Out[6]:

Index(['Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'], dtyp


e='object')
20

In [7]:

col = ['Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']


for i in col:
plt.figure(figsize =(5,3), dpi = 100)
plt.hist(x = i, data = data)
plt.xlabel(col)
plt.show()
21

In [10]:

data['Genre'] = data['Genre'].map({'Male':1,'Female':2 }) # we can use get_dummies instead


data.head()

Out[10]:

Genre Age Annual Income (k$) Spending Score (1-100)

0 1 19 15 39

1 1 21 15 81

2 2 20 16 6

3 2 23 16 77

4 2 31 17 40

In [11]:

data.dtypes

Out[11]:

Genre int64
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object

In [26]:

#scaling transformation
#1.zscore normalization using standardScalar(Same mean)
#2.Minmax normalization using MinMax Scaler(0 to 1)
22

In [12]:

df_customer = data.iloc[:,2:4]
df_customer.head()

Out[12]:

Annual Income (k$) Spending Score (1-100)

0 15 39

1 15 81

2 16 6

3 16 77

4 17 40

In [13]:

from sklearn.preprocessing import StandardScaler


data_scaled = StandardScaler().fit_transform(df_customer)
data_scaled

Out[13]:

array([[-1.73899919, -0.43480148],
[-1.73899919, 1.19570407],
[-1.70082976, -1.71591298],
[-1.70082976, 1.04041783],
[-1.66266033, -0.39597992],
[-1.66266033, 1.00159627],
[-1.62449091, -1.71591298],
[-1.62449091, 1.70038436],
[-1.58632148, -1.83237767],
[-1.58632148, 0.84631002],
[-1.58632148, -1.4053405 ],
[-1.58632148, 1.89449216],
[-1.54815205, -1.36651894],
[-1.54815205, 1.04041783],
[-1.54815205, -1.44416206],
[-1.54815205, 1.11806095],
[-1.50998262, -0.59008772],
[-1.50998262, 0.61338066],

In [14]:

# Finding the optimal number of K


wcss= []
for i in range(2,11):
kmodel = KMeans(n_clusters = i, init = 'random')
kmodel.fit(data_scaled)
wcss.append(kmodel.inertia_)
23

In [15]:

wcss

Out[15]:

[269.01679374906655,
157.70400815035939,
108.92131661364358,
65.56840815571681,
55.103778121150555,
44.86475569922555,
37.24321153347672,
33.85792110528426,
30.684270071530346]

In [16]:

#Plotting
plt.figure(figsize = (8,6), dpi=100)
plt.plot(range(2,11),wcss, marker = 'o', c='blue', markerfacecolor='red')
plt.xlabel('No of Clusters')
plt.ylabel('WCSS')
plt.show()

In [17]:

# Creating the final Kmeans model with no of clusters = 5


Kmodel_final = KMeans(n_clusters = 5, init = 'k-means++').fit(data_scaled)
24

In [18]:

cl = Kmodel_final.predict(data_scaled)

In [19]:

cl

Out[19]:

array([0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3,
0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 1,
0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 2, 1, 2, 4, 2, 4, 2,
1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])

In [20]:

# Adding the clusters to a new column in the dataset


df_customer['cl']=cl
df_customer.head()

Out[20]:

Annual Income (k$) Spending Score (1-100) cl

0 15 39 0

1 15 81 3

2 16 6 0

3 16 77 3

4 17 40 0
25

In [27]:

# Visualization of clusters
plt.figure(figsize = (6,4), dpi = 100)
plt.scatter(x=df_customer['Annual Income (k$)'],y=df_customer['Spending Score (1-100)'],c=c
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score')
plt.show()

cl 1=high income low spender c2 = high income high spender c3 = low income high spender c4 = low income
low spender c5 = moderate income moderate spender

Conclusion
Mall customer data is clustered into 5 clusters. The green cluster indicates the people who have high spending
score but a low annual income. The purple cluster shows the people who have a low annual income & low
M 26

spending score. The blue cluster shows people who have an average annual income & average spending
score. The sea green cluster indicate the people who have high annual income & high spending score. The
yellow cluster shows the people who have low spending score & high annual income.

In [ ]:
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 4 :-
Heirarchical Clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import cdist
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [3]:

data = pd.read_csv('C:/Users/archa/Downloads/USArrests.csv')
data.head()

Out[3]:

Unnamed: 0 Murder Assault UrbanPop Rape

0 Alabama 13.2 236 58 21.2

1 Alaska 10.0 263 48 44.5

2 Arizona 8.1 294 80 31.0

3 Arkansas 8.8 190 50 19.5

4 California 9.0 276 91 40.6

In [4]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 50 non-null object
1 Murder 50 non-null float64
2 Assault 50 non-null int64
3 UrbanPop 50 non-null int64
4 Rape 50 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 2.1+ KB
2

In [6]:

data.isna().sum()

Out[6]:

Unnamed: 0 0
Murder 0
Assault 0
UrbanPop 0
Rape 0
dtype: int64

In [7]:

data.shape

Out[7]:

(50, 5)
3

In [8]:

data['Unnamed: 0'].value_counts()

Out[8]:

Georgia 1
Nevada 1
Maryland 1
Hawaii 1
North Dakota 1
Nebraska 1
Rhode Island 1
Missouri 1
Oregon 1
Virginia 1
North Carolina 1
New Hampshire 1
Indiana 1
Idaho 1
New Mexico 1
Florida 1
Oklahoma 1
Arizona 1
Delaware 1
New Jersey 1
Montana 1
Colorado 1
Illinois 1
Vermont 1
Tennessee 1
Arkansas 1
Kansas 1
Ohio 1
Massachusetts 1
South Dakota 1
Louisiana 1
Kentucky 1
Utah 1
Minnesota 1
Alabama 1
West Virginia 1
Washington 1
Pennsylvania 1
Wisconsin 1
Connecticut 1
Texas 1
Wyoming 1
Mississippi 1
California 1
Michigan 1
Iowa 1
South Carolina 1
New York 1
Maine 1
Alaska 1
Name: Unnamed: 0, dtype: int64
4

In [16]:

data.describe()

Out[16]:

Murder Assault UrbanPop Rape

count 50.00000 50.000000 50.000000 50.000000

mean 7.78800 170.760000 65.540000 21.232000

std 4.35551 83.337661 14.474763 9.366385

min 0.80000 45.000000 32.000000 7.300000

25% 4.07500 109.000000 54.500000 15.075000

50% 7.25000 159.000000 66.000000 20.100000

75% 11.25000 249.000000 77.750000 26.175000

max 17.40000 337.000000 91.000000 46.000000

In [17]:

#Renaming the column that contains the states


data.rename(columns = {'Unnamed: 0':'States'}, inplace = True)
data.head()

Out[17]:

States Murder Assault UrbanPop Rape

0 Alabama 13.2 236 58 21.2

1 Alaska 10.0 263 48 44.5

2 Arizona 8.1 294 80 31.0

3 Arkansas 8.8 190 50 19.5

4 California 9.0 276 91 40.6

In [18]:

# Checking the murder rate


plt.figure(figsize = (20,5))
data.groupby('States')['Murder'].max().plot(kind ='bar')
plt.show()
5

In [34]:

# Checking the assault rate


plt.figure(figsize = (20,5))
data.groupby('States')['Assault'].max().plot(kind ='bar')
plt.show()

In [19]:

# Converting the States column to numeric using get_dummies from pandas library
data_num = pd.get_dummies(data, columns = ['States'])
data_num.head()

Out[19]:

Murder Assault UrbanPop Rape States_Alabama States_Alaska States_Arizona States_A

0 13.2 236 58 21.2 1 0 0

1 10.0 263 48 44.5 0 1 0

2 8.1 294 80 31.0 0 0 1

3 8.8 190 50 19.5 0 0 0

4 9.0 276 91 40.6 0 0 0

5 rows × 54 columns
20

# Using average linkage as the distance metrics


link = linkage(data_num, 'average')
link

Out[20]:

array([[ 14. , 28. , 2.6925824 , 2. ],


[ 16. , 25. , 4.08656335, 2. ],
[ 13. , 15. , 4.1761226 , 2. ],
[ 12. , 31. , 6.39531078, 2. ],
[ 34. , 43. , 6.7867518 , 2. ],
[ 35. , 45. , 7.48999332, 2. ],
[ 6. , 37. , 8.15107355, 2. ],
[ 18. , 40. , 8.65390085, 2. ],
[ 48. , 50. , 10.2823621 , 3. ],
[ 49. , 55. , 10.83611024, 3. ],
[ 47. , 57. , 10.86685976, 3. ],
[ 20. , 29. , 11.54339638, 2. ],
[ 26. , 51. , 12.51925264, 3. ],
[ 3. , 41. , 12.69330532, 2. ],
[ 36. , 59. , 12.95749268, 4. ],
[ 33. , 44. , 13.12135664, 2. ],
[ 21. , 27. , 13.37235955, 2. ],
[ 52. , 56. , 13.42862855, 4. ],
[ 2. , 30. , 13.96782016, 2. ],
[ 5. , 42. , 14.56983185, 2. ],
[ 11. , 62. , 15.09585446, 4. ],
[ 54. , 67. , 15.19058848, 6. ],
[ 19. , 68. , 15.51774813, 3. ],
[ 0. , 17. , 15.51902059, 2. ],
[ 46. , 64. , 16.49150054, 5. ],
[ 7. , 73. , 16.95059831, 3. ],
[ 53. , 66. , 18.4731512 , 4. ],
[ 22. , 58. , 19.04598906, 4. ],
[ 24. , 63. , 20.2507332 , 3. ],
[ 70. , 71. , 20.65162046, 10. ],
[ 23. , 39. , 21.21438191, 2. ],
[ 38. , 61. , 22.64143504, 3. ],
[ 9. , 69. , 24.01391232, 3. ],
[ 75. , 76. , 26.40399464, 7. ],
[ 74. , 81. , 26.7531159 , 8. ],
[ 65. , 77. , 27.81929489, 6. ],
[ 4. , 72. , 28.04933353, 4. ],
[ 1. , 80. , 28.13138593, 3. ],
[ 78. , 82. , 29.08942271, 6. ],
[ 60. , 85. , 33.14911215, 9. ],
[ 8. , 32. , 38.55385843, 2. ],
[ 83. , 87. , 39.42094783, 10. ],
[ 10. , 89. , 41.12189846, 10. ],
[ 86. , 91. , 44.30793939, 14. ],
[ 84. , 88. , 44.86221876, 14. ],
[ 79. , 92. , 54.76706367, 20. ],
[ 90. , 93. , 77.61895275, 16. ],
[ 94. , 95. , 89.24555288, 34. ],
[ 96. , 97. , 152.3219326 , 50. ]])
7

# Implementing the agglomerative clustering


dendrogram(link)
plt.title('Truncated Hierarchical clustering')
plt.xlabel('CLuster Size')
plt.ylabel('distance')
plt.show()

In [23]:

k = 3
h_cluster = AgglomerativeClustering(n_clusters = k, affinity = 'euclidean', linkage = 'ward
h_cluster.fit(data_num)

Out[23]:

AgglomerativeClustering(n_clusters=3)

In [24]:

cluster = h_cluster.fit_predict(data_num)

In [26]:

# Silhouette Score
print('Silhouette Score: %0.3f' % metrics.silhouette_score(data_num,h_cluster.labels_))

Silhouette Score: 0.532

K MEANS CLUSTERING
M 8

In [27]:

# Finding the optimal number of clusters


wcss = []
for i in range(2,9):
kmodel = KMeans(n_clusters = i, init = 'random')
kmodel.fit(data_num)
wcss.append(kmodel.inertia_)

In [28]:

wcss

Out[28]:

[96447.02814449916,
48011.26535714287,
34774.629357142854,
29124.6065,
19001.82888888889,
16633.143809523808,
13960.15160714286]

In [29]:

# Elbow plot
plt.figure(figsize = (6,4), dpi = 100)
plt.plot(range(2,9), wcss, marker = 'o', c = 'blue')
plt.xlabel('No of clusters')
plt.ylabel('WCSS')
plt.show()
9

In [30]:
# Final model with 5 clusters
kmod_final = KMeans(n_clusters = 5, init = 'k-means++').fit(data_num)
cl = kmod_final.predict(data_num)
cl

Out[30]:

array([0, 0, 4, 3, 0, 3, 1, 0, 4, 3, 2, 1, 0, 1, 2, 1, 1, 0, 2, 4, 3, 0,
2, 0, 3, 1, 1, 0, 2, 3, 0, 0, 4, 2, 1, 3, 3, 1, 3, 0, 2, 3, 3, 1,
2, 3, 3, 2, 2, 3])

In [31]:

# Silhouette Score
from sklearn.metrics import silhouette_score
score = silhouette_score(data_num, kmod_final.labels_, metric='euclidean')

In [32]:

score

Out[32]:

0.4486735234754001

In [ ]:
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 5 :-
Implement KNN & PCA

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

In [3]:

titan_train = pd.read_csv("C:/Users/archa/Downloads/titan_train.csv")
titan_train.head()

Out[3]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...

Heikkinen, STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
2

In [5]:

titan_test = pd.read_csv("C:/Users/archa/Downloads/titan_test.csv")
titan_test.head()

Out[5]:

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Kelly, Mr.
0 892 3
male 34.5 0 0 330911 7.8292 NaN
James
Wilkes,
Mrs.
1 893 3 James female 47.0 1 0 363272 7.0000 NaN
(Ellen
Needs)

Myles, Mr.
2 894 2 Thomas male 62.0 0 0 240276 9.6875 NaN
Francis
Wirz, Mr.
3 895 3
male 27.0 0 0 315154 8.6625 NaN
Albert
Hirvonen,
Mrs.
4 896 3 Alexander female 22.0 1 1 3101298 12.2875 NaN
(Helga E
Lindqvist)

In [6]:

titan_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
3

In [7]:

y_train = titan_train['Survived']

# Studying the data values


y_train.value_counts()

Out[7]:

0 549
1 342
Name: Survived, dtype: int64

In [8]:

# Checking the cabin feature


titan_train['Cabin'].isna().sum()/titan_train.shape[0]

Out[8]:

0.7710437710437711

In [9]:

# Dropping the columns that are not required


titan_train.drop(['Name','PassengerId','Ticket','Cabin'], axis = 1, inplace = True)
titan_test.drop(['Name','PassengerId','Ticket','Cabin'], axis = 1, inplace = True)
titan_train.head()

Out[9]:

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 0 3 male 22.0 1 0 7.2500 S

1 1 1 female 38.0 1 0 71.2833 C

2 1 3 female 26.0 0 0 7.9250 S

3 1 1 female 35.0 1 0 53.1000 S

4 0 3 male 35.0 0 0 8.0500 S


4

In [10]:

# Dealing with missing values in the training set and test set
titan_train['Embarked'] = titan_train['Embarked'].fillna('S')
titan_train['Age'] = titan_train['Age'].fillna(titan_train['Age'].mean())
titan_test['Embarked'] = titan_test['Embarked'].fillna('S')
titan_test['Age'] = titan_test['Age'].fillna(titan_test['Age'].mean())
titan_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null object
3 Age 891 non-null float64
4 SibSp 891 non-null int64
5 Parch 891 non-null int64
6 Fare 891 non-null float64
7 Embarked 891 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB

In [12]:

Converting the categorical variables


itan_train = pd.get_dummies(titan_train, columns = ['Sex', 'Embarked'], drop_first = False)
itan_test = pd.get_dummies(titan_test, columns = ['Sex', 'Embarked'], drop_first = False)
itan_train.head()

Out[12]:

Survived Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embar

0 0 3 22.0 1 0 7.2500 0 1 0

1 1 1 38.0 1 0 71.2833 1 0 1

2 1 3 26.0 0 0 7.9250 1 0 0

3 1 1 35.0 1 0 53.1000 1 0 0

4 0 3 35.0 0 0 8.0500 0 1 0
5

In [13]:

titan_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Age 418 non-null float64
2 SibSp 418 non-null int64
3 Parch 418 non-null int64
4 Fare 417 non-null float64
5 Sex_female 418 non-null uint8
6 Sex_male 418 non-null uint8
7 Embarked_C 418 non-null uint8
8 Embarked_Q 418 non-null uint8
9 Embarked_S 418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.5 KB

In [14]:

# treating the missing value in the test set


titan_test['Fare'] = titan_test['Fare'].fillna(titan_test['Fare'].mean())
titan_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Age 418 non-null float64
2 SibSp 418 non-null int64
3 Parch 418 non-null int64
4 Fare 418 non-null float64
5 Sex_female 418 non-null uint8
6 Sex_male 418 non-null uint8
7 Embarked_C 418 non-null uint8
8 Embarked_Q 418 non-null uint8
9 Embarked_S 418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.5 KB

In [15]:

y_train.shape

Out[15]:

(891,)
6

In [17]:

# Scaling the data


scaled_titan_train = StandardScaler().fit_transform(titan_train)
scaled_titan_test = StandardScaler().fit_transform(titan_test)

In [19]:

# Applying PCA
pca_model = PCA(n_components = 0.95)
pca_scaled_titan = pca_model.fit_transform(scaled_titan_train)
pca_model.explained_variance_ratio_

Out[19]:

array([0.26196203, 0.17931188, 0.1521482 , 0.12721991, 0.0876279 ,


0.06664062, 0.05123602, 0.04365245])

In [20]:

# Plotting to see how many components shoud be used


plt.figure(figsize = (8,7))
plt.plot(np.cumsum(pca_model.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Expalined Variance')
plt.show()
7

In [21]:

pca_model.n_components_

Out[21]:

In [23]:

data into training and validation data


x_test, val_train, val_test = train_test_split(pca_scaled_titan, y_train, test_size = 0.3)
shape, x_test.shape, val_train.shape, val_test.shape

Out[23]:

((623, 8), (268, 8), (623,), (268,))

In [24]:

# Find the optimal number of k


params = {'n_neighbors': range(1, 10)}
gscv_model = GridSearchCV(KNeighborsClassifier(), params)

In [27]:

# Fitting the model


gscv_model.fit(titan_train, val_train)

Out[27]:

GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 10)})

In [28]:

# best value of k
gscv_model.best_params_

Out[28]:

{'n_neighbors': 1}

In [29]:

# Applying KNN
knn_model = KNeighborsClassifier(n_neighbors = 1).fit(titan_train, val_train)
pred = knn_model.predict(x_test)
8

In [30]:

# confusion matrix
cm = confusion_matrix(val_test, pred)
print(cm)

[[165 1]
[ 0 102]]

In [31]:

# Classification report
print(classification_report(val_test, pred))

precision recall f1-score support

0 1.00 0.99 1.00 166


1 0.99 1.00 1.00 102

accuracy 1.00 268


macro avg 1.00 1.00 1.00 268
weighted avg 1.00 1.00 1.00 268

In [32]:

# testing on test data


titan_test.shape

Out[32]:

(418, 10)

In [33]:

# Applying pca on the test data


pca_mod = PCA(n_components= 8)

In [34]:

pca_test = pca_mod.fit_transform(scaled_titan_test)

In [35]:

pca_mod.explained_variance_ratio_

Out[35]:

array([0.25455619, 0.1968992 , 0.17360053, 0.12081349, 0.09336117,


0.06958407, 0.05654625, 0.03463909])
9

In [36]:

plt.figure(figsize = (8,7))
plt.plot(np.cumsum(pca_mod.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show()

In [37]:

pca_mod.n_components_

Out[37]:

8
10

In [39]:

pred2 = knn_model.predict(pca_test)
pred2

Out[39]:

array([0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,
1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
dtype=int64)

In [ ]:

b 10/10
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 6 :-
Implement KNN-REGRESSOR

In [2]:

# Importing the libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as mse
from sklearn.neighbors import KNeighborsRegressor
from math import sqrt
from sklearn.model_selection import GridSearchCV

In [2]:

# Reading the data


df_data = pd.read_csv('C:/Users/archa/Downloads/Admission_Predict.csv')
df_data.head()

Out[2]:

Serial GRE TOEFL University Chance of


SOP LOR CGPA Research
No. Score Score Rating Admit

0 1 337 118 4 4.5 4.5 9.65 1 0.92

1 2 324 107 4 4.0 4.5 8.87 1 0.76

2 3 316 104 3 3.0 3.5 8.00 1 0.72

3 4 322 110 3 3.5 2.5 8.67 1 0.80

4 5 314 103 2 2.0 3.0 8.21 0 0.65

In [3]:

df_data.shape

Out[3]:

(400, 9)
2

In [8]:

df_data.columns

Out[8]:

Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',


'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
dtype='object')

In [6]:

df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 400 non-null int64
1 GRE Score 400 non-null int64
2 TOEFL Score 400 non-null int64
3 University Rating 400 non-null int64
4 SOP 400 non-null float64
5 LOR 400 non-null float64
6 CGPA 400 non-null float64
7 Research 400 non-null int64
8 Chance of Admit 400 non-null float64
dtypes: float64(4), int64(5)
memory usage: 28.2 KB

In [10]:

# removing the spaces between the column names


df_data.columns = ['Serial_No.', 'GRE_Score', 'TOEFL_Score', 'University_Rating', 'SOP',
'LOR ', 'CGPA', 'Research', 'Chance_of_Admit']
3

In [12]:

# Checking the correlation


plt.figure(figsize = (8,6))
sns.heatmap(df_data.corr(), annot = True)
plt.show()
4

In [16]:

# Dropping teh variables that are not required


df_data.drop(['Serial_No.'], axis = 1, inplace = True)
df_data.head()

Out[16]:

GRE_Score TOEFL_Score University_Rating SOP LOR CGPA Research Chance_of_Admi

0 337 118 4 4.5 4.5 9.65 1 0.9

1 324 107 4 4.0 4.5 8.87 1 0.7

2 316 104 3 3.0 3.5 8.00 1 0.7

3 322 110 3 3.5 2.5 8.67 1 0.8

4 314 103 2 2.0 3.0 8.21 0 0.6

In [18]:

# Splitting the independent and dependent variables


x = df_data.drop('Chance_of_Admit', axis = 1)
y = df_data['Chance_of_Admit']
x.head()

Out[18]:

GRE_Score TOEFL_Score University_Rating SOP LOR CGPA Research

0 337 118 4 4.5 4.5 9.65 1

1 324 107 4 4.0 4.5 8.87 1

2 316 104 3 3.0 3.5 8.00 1

3 322 110 3 3.5 2.5 8.67 1

4 314 103 2 2.0 3.0 8.21 0

In [19]:

# Scaling the data


scaler = StandardScaler()
scaled_x = scaler.fit_transform(x)

In [20]:

# Splitting the data into train and test set


x_tr, x_test, y_tr, y_test = train_test_split(scaled_x, y)
x_tr.shape, x_test.shape, y_tr.shape, y_test.shape

Out[20]:

((300, 7), (100, 7), (300,), (100,))


5

In [21]:

# Applying Linear regression


lr_model = LinearRegression().fit(x_tr, y_tr)

In [22]:

pred_1 = lr_model.predict(x_test)
pred_1

Out[22]:

array([0.67909121, 0.95551592, 0.72894865, 0.48583543, 0.53852174,


0.67092294, 0.8204142 , 0.7783607 , 0.64455292, 0.46479729,
0.70176684, 0.65919873, 0.60116658, 0.91029432, 0.70461871,
0.61148716, 0.74706861, 0.7249588 , 0.94875688, 0.82792022,
0.77842265, 0.86015917, 0.52533426, 0.95098272, 0.8373398 ,
0.54278208, 0.58086278, 0.69831249, 0.41788173, 0.58148695,
0.63403675, 0.51614761, 0.72443846, 1.0057703 , 0.80674416,
0.5145337 , 1.00772297, 0.54703292, 0.68594236, 0.94203988,
0.77378915, 0.66626015, 0.53689877, 0.66415535, 0.79274928,
0.85642724, 0.6455203 , 0.63928183, 0.72364687, 0.73350815,
0.90680501, 0.89234368, 0.64048648, 0.93967212, 0.83906353,
0.59857466, 0.84595632, 0.75492832, 0.61715465, 0.75846671,
0.56496637, 0.93930258, 0.67270008, 0.6913754 , 0.64666463,
0.73590017, 0.7138422 , 0.7221923 , 0.46382825, 0.58880676,
0.58715168, 0.63054123, 0.82516573, 0.7874021 , 0.73647047,
0.67348602, 0.59966386, 0.77876368, 0.62996058, 0.74879008,
0.75509647, 0.51474536, 0.79239247, 0.69871467, 0.70744775,
0.64914752, 0.67588826, 0.61908792, 0.7867352 , 0.78725215,
0.61198433, 0.96166414, 0.64506994, 0.71264468, 0.94441163,
0.42890377, 0.68670028, 0.64288935, 0.80816794, 0.6564213 ])

In [23]:

# Checking the r2_score


r2_score(y_test, pred_1)

Out[23]:

0.7693406066446685

In [24]:

# Root mean squared error


sqrt(mse(y_test, pred_1))

Out[24]:

0.0674866003456461

KNN REGRESSION
6

In [25]:

# Find the optimal number of k


params = {'n_neighbors': range(1,20)}
gscv_model = GridSearchCV(KNeighborsRegressor(), params, cv = 5)

In [26]:

# Fitting the model


gscv_model.fit(x_tr, y_tr)

Out[26]:

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
param_grid={'n_neighbors': range(1, 20)})

In [27]:

# Best value for k


gscv_model.best_params_

Out[27]:

{'n_neighbors': 16}

In [28]:

# Applying knn regression


knn_reg = KNeighborsRegressor(n_neighbors = 7).fit(x_tr, y_tr)
pred_knn = knn_reg.predict(x_test)

In [29]:

# r2 score
r2_score(y_test, pred_knn)

Out[29]:

0.7484011160531958

In [30]:

# Checking rmse
mse(y_test, pred_knn, squared = False)

Out[30]:

0.07048331688551321

In [ ]:
7

NAME : ARCHANA NAIR


SUBJECT : Algorithm for Data Science
COURSE : M.Sc. Computer Science with Specialization in
Data Science
PRACTICAL 7 (TIME SERIES)
install.packages("timeSeries")
library(timeSeries)
install.packages("forecast")
library(forecast)
data=read.csv('C:/Users/archa/Downloads/daily-total-female-
births.csv')
head(data)

data$Date = as.Date(data$Date)
data['year'] = strftime(data$Date,'%Y')
8

data['mon'] = strftime(data$Date,'%b')
data['day'] = strftime(data$Date,'%d')
data

df_new = subset(data, data$day!=31, c('Births'))


df_new
9

# Plotting the data.


plot(df_new$Births)

# Converting the data to time series format.


data = ts(data = df_new, frequency = 30)
head(df_new)
plot(data)
10

# Print the start, end and frequency


# start
start(data)
# end
end(data)
# frequency
frequency(data)
11

#Decompose the dataset


decomp_data = decompose(data)
plot(decomp_data)

# Explore it’s components.


12

plot(decomp_data$trend)

plot(decomp_data$random)

plot(decomp_data$seasonal)
13

boxplot(data~cycle(data))

#Remove the seasonality and trend.


stationary_2= data-decomp_data$seasonal-decomp_data$trend
plot(stationary_2)
14

#Apply ARIMA
arima_mod = auto.arima(stationary_2)
f_am = forecast(arima_mod,h = 5, level = c(95))
f_am
plot(f_am)
15

# ARIMA on the original time series data of daily births


arima_mod2 = auto.arima(data)
f_org = forecast(arima_mod2, h= 5, level = c(95))
f_org
plot(f_org)
1

ARCHANA NAIR #M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai
#PRACTICAL 8 - Predicting the authors of the disputed federalist papers

In [41]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

In [42]:

data = pd.read_csv('C:/Users/archa/Downloads/Disputed_Essay_data.csv')
data.head()

Out[42]:

author filename a all also an and any are as ... was w

0 dispt dispt_fed_49.txt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 ... 0.009 0.0

1 dispt dispt_fed_50.txt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 ... 0.051 0.0

2 dispt dispt_fed_51.txt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 ... 0.008 0.0

3 dispt dispt_fed_52.txt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 ... 0.087 0.0

4 dispt dispt_fed_53.txt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 ... 0.027 0.0

5 rows × 72 columns

In [43]:

# Splitting the data


data_test = data[data['author']=='dispt']
data_tr = data[data['author']!='dispt']
2

In [44]:

data_tr.head()

Out[44]:

author filename a all also an and any are as ... w

11 Hamilton Hamilton_fed_1.txt 0.213 0.083 0.000 0.083 0.343 0.056 0.111 0.093 ... 0.0

12 Hamilton Hamilton_fed_11.txt 0.369 0.070 0.006 0.076 0.411 0.023 0.053 0.117 ... 0.0

13 Hamilton Hamilton_fed_12.txt 0.305 0.047 0.007 0.068 0.386 0.047 0.102 0.108 ... 0.0

14 Hamilton Hamilton_fed_13.txt 0.391 0.045 0.015 0.030 0.270 0.045 0.060 0.090 ... 0.0

15 Hamilton Hamilton_fed_15.txt 0.327 0.096 0.000 0.086 0.356 0.014 0.086 0.072 ... 0.0

5 rows × 72 columns

In [45]:

data_test.head()

Out[45]:

author filename a all also an and any are as ... was w

0 dispt dispt_fed_49.txt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 ... 0.009 0.0

1 dispt dispt_fed_50.txt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 ... 0.051 0.0

2 dispt dispt_fed_51.txt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 ... 0.008 0.0

3 dispt dispt_fed_52.txt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 ... 0.087 0.0

4 dispt dispt_fed_53.txt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 ... 0.027 0.0

5 rows × 72 columns
3

In [46]:

# Checking for missing values


data_tr.isna().sum()

Out[46]:

author 0
filename 0
a 0
all 0
also 0
..
who 0
will 0
with 0
would 0
your 0
Length: 72, d type: int64

In [47]:

data_tr['author'].value_counts()

Out[47]:

Hamilton 51
Madison 15
Jay 5
HM 3
Name: author, dtype: int64

In [48]:

# Seperating the features and target


x = data_tr.drop(['author','filename'], axis = 1)
y = data_tr['author']
x.head()

Out[48]:

a all also an and any are as at be ... was were wha

11 0.213 0.083 0.000 0.083 0.343 0.056 0.111 0.093 0.065 0.315 ... 0.000 0.000 0.00

12 0.369 0.070 0.006 0.076 0.411 0.023 0.053 0.117 0.065 0.258 ... 0.000 0.012 0.01

13 0.305 0.047 0.007 0.068 0.386 0.047 0.102 0.108 0.088 0.271 ... 0.000 0.000 0.00

14 0.391 0.045 0.015 0.030 0.270 0.045 0.060 0.090 0.015 0.376 ... 0.000 0.000 0.00

15 0.327 0.096 0.000 0.086 0.356 0.014 0.086 0.072 0.115 0.211 ... 0.014 0.038 0.01

5 rows × 70 columns
4

In [49]:

# Splitting the data for training and testing


x_tr, x_test, y_tr, y_test = train_test_split(x ,y, test_size = 0.2)

In [50]:

# Applying the decision tree


dtree = DecisionTreeClassifier(criterion = 'gini').fit(x_tr, y_tr)

In [51]:

pred_dtree = dtree.predict(x_test)

In [52]:

# Classification Report
print(classification_report(y_test, pred_dtree))

precision recall f1-score support

HM 0.00 0.00 0.00 1


Hamilton 0.91 1.00 0.95 10
Madison 0.75 0.75 0.75 4

accuracy 0.87 15
macro avg 0.55 0.58 0.57 15
weighted avg 0.81 0.87 0.83 15

C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
5/15/22, 4:39 PM 5

In [53]:

# Confusion matrix
cm = confusion_matrix(y_test, pred_dtree)
cm

Out[53]:

array([[ 0, 0, 1],
[ 0, 10, 0],
[ 0, 1, 3]], dtype=int64)

In [54]:

# Accuracy Score
accuracy_score(y_test, pred_dtree)

Out[54]:

0.8666666666666667

In [55]:

# Printing the decision tree


tree.export_text(dtree)

Out[55]:

'|--- feature_59 <= 0.01\n| |--- feature_64 <= 0.11\n| | |--- feature_ 4
<= 0.57\n| | | |--- feature_33 <= 0.02\n| | | | |--- class: HM\n|
| | |--- feature_33 > 0.02\n| | | | |--- class: Hamilto
n\n| ||--- feature_4 > 0.57\n| | | |--- class: Jay\n| |--- fea
ture_64 > 0.11\n| | |--- feature_38 <= 0.74\n| | | |--- class: HM
\n| | |--- feature_38 > 0.74\n| | | |--- class: Madison\n|--- fea
ture_59 > 0.01\n| |--- class: Hamilton\n'
6

In [56]:

0,20))
eature_names = x.columns, class_names = data['author'].value_counts().index, filled = True)
7

In [57]:

# Predcition on test data


data_test = data_test.drop(['author', 'filename'], axis =1)

In [58]:

pred_test = dtree.predict(data_test)
pred_test

Out[58]:

array(['Madison', 'Madison', 'Hamilton', 'Madison', 'Madison', 'Madison',


'Madison', 'Madison', 'Madison', 'Madison', 'Madison'],
dtype=object)

In [59]:

data_test['Predicted Author'] = pred_test


data_test.head()

Out[59]:

a all also an and any are as at be ... were what when

0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.017 0.000 0.009

1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.000 0.000 0.000

2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.015 0.008 0.000

3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.079 0.008 0.024

4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.020 0.020 0.007

5 rows × 71 columns

In [60]:

# Applying KNN Classification

In [61]:

# Scaling the data


scaled_x_tr = StandardScaler().fit_transform(x)
scaled_x_test = StandardScaler().fit_transform(x_test)
8

In [62]:

# Splitting data into training and validation data


x_tr, x_test, yy_tr, yy_test = train_test_split(scaled_x_tr, y, test_size = 0.2)
x_tr.shape, x_test.shape, y_tr.shape, y_test.shape

Out[62]:

((59, 70), (15, 70), (59,), (15,))

In [63]:

# Find the optimal number of k


params = {'n_neighbors':range(1,50)}
gscv_mod = GridSearchCV(KNeighborsClassifier(), params, cv = 10)

In [64]:

gscv_mod.fit(x_tr, y_tr)

C:\Users\verma\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:666: UserWarning: The least populated class in y has only 2 members, which
is less than n_splits=10.
warnings.warn(("The least populated class in y has only %d"

Out[64]:

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 50)})

In [65]:

# Best k
gscv_mod.best_params_

Out[65]:

{'n_neighbors': 10}

In [66]:

# Applying knn
knn_mod = KNeighborsClassifier(n_neighbors = 7).fit(x_tr, y_tr)

In [67]:

# Predicting
pred_knn = knn_mod.predict(x_test)
pred_knn

Out[67]:

array(['Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton',


'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton',
'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton'],
dtype=object)
9

In [68]:

data_test2 = data_test.drop('Predicted Author', axis = 1)


data_test2

Out[68]:

a all also an and any are as at be ... was were wha

0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.009 0.017 0.00

1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.051 0.000 0.00

2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.008 0.015 0.00

3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.087 0.079 0.00

4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.027 0.020 0.02

5 0.245 0.059 0.007 0.067 0.282 0.052 0.111 0.252 0.015 0.297 ... 0.007 0.030 0.01

6 0.349 0.036 0.007 0.029 0.335 0.058 0.087 0.073 0.116 0.378 ... 0.015 0.029 0.01

7 0.414 0.083 0.009 0.018 0.478 0.046 0.110 0.074 0.037 0.331 ... 0.018 0.009 0.00

8 0.248 0.040 0.007 0.040 0.356 0.034 0.154 0.161 0.047 0.289 ... 0.027 0.007 0.02

9 0.442 0.062 0.006 0.075 0.423 0.037 0.093 0.100 0.031 0.379 ... 0.000 0.000 0.02

10 0.276 0.048 0.015 0.082 0.324 0.044 0.058 0.135 0.048 0.290 ... 0.044 0.024 0.00

11 rows × 70 columns

In [69]:

# predicting using the test data


pred_test = knn_mod.predict(data_test2)

In [70]:

data_test2['Pred_Author'] = pred_test
data_test2.head()

Out[70]:

a all also an and any are as at be ... were what when

0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.017 0.000 0.009

1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.000 0.000 0.000

2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.015 0.008 0.000

3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.079 0.008 0.024

4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.020 0.020 0.007

5 rows × 71 columns
k 10

In [71]:

# Confusion matrix
print(confusion_matrix(y_test, pred_knn))

[[ 0 1 0]
[ 0 10 0]
[ 0 4 0]]

In [72]:

accuracy_score(y_test, pred_knn)

Out[72]:

0.6666666666666666

The model created using Decision Tree Classifier gives an accuracy score of 87% whereas the model created
using K Neighbors gives an accuracy score of 67%.

Therefore, from the accuracy we can see that the decision tree performs better in this case.

In [ ]:

10/10
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 9:
Sentiment analysis

In [8]:

import nltk
from wordcloud import WordCloud
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

import pandas as pd
import numpy as np

In [10]:

data = pd.read_csv('C:/Users/archa/Downloads/K8 Reviews v0.2.csv')


data.head()

Out[10]:

sentiment review

0 1 Good but need updates and improvements

1 0 Worst mobile i have bought ever, Battery is dr...

2 1 when I will get my 10% cash back.... its alrea...

3 1 Good

4 0 The worst phone everThey have changed the last...

In [11]:

# Data preprocessing
import string
string.punctuation

Out[11]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
2

In [13]:

# Tokenization using word_token or RegexpTokenizer


nltk.download('punkt')
words = []
for i in data['review']:
words.append(word_tokenize(i))

[nltk_data] Downloading package punkt to


[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.

In [14]:

words

Out[14]:

[['Good', 'but', 'need', 'updates', 'and', 'improvements'],


['Worst',
'mobile',
'i',
'have',
'bought',
'ever',
',',
'Battery',
'is',
'draining',
'like',
'hell',
',',
'backup',
'is',
'only',
'6',

In [15]:

# Removing the digits and punctuations


word_token = []
for i in data['review']:
word_token.append(RegexpTokenizer('[a-z|A-Z]+').tokenize(i))
3

In [16]:

word_token

Out[16]:

[['Good', 'but', 'need', 'updates', 'and', 'improvements'],


['Worst',
'mobile',
'i',
'have',
'bought',
'ever',
'Battery',
'is',
'draining',
'like',
'hell',
'backup',
'is',
'only',
'to',
'hours',
'with',

In [17]:

len(word_token)

Out[17]:

14675

In [20]:

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to


[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.

Out[20]:

True
4

In [21]:

# removing stop words


stopwords.words("English")

Out[21]:

['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',
'yourself',
'yourselves',
'he',

In [25]:

word_clean = []
for i in word_token:
word_clean_row = []
for j in i:
if j not in stopwords.words("English"):
word_clean_row.append(j)
word_clean.append(word_clean_row)

In [26]:

word_clean

Out[26]:

[['Good', 'need', 'updates', 'improvements'],


['Worst',
'mobile',
'bought',
'ever',
'Battery',
'draining',
'like',
'hell',
'backup',
'hours',
'internet',
'uses',
'even',
'I',
'put',
'mobile',
'idle',
5

In [29]:

# Converting the text to lower case


word_lower_row = []
word_lower = []
for i in word_clean:
word_lower_row = []
for j in i:
word_lower_row.append(j.lower())
word_lower.append(word_lower_row)

In [30]:

word_lower

Out[30]:

[['good', 'need', 'updates', 'improvements'],


['worst',
'mobile',
'bought',
'ever',
'battery',
'draining',
'like',
'hell',
'backup',
'hours',
'internet',
'uses',
'even',
'i',
'put',
'mobile',
'idle',

In [31]:

# stemming
word_stem_row = []
word_stem = []
for i in word_lower:
word_stem_row = []
for j in i:
word_stem_row.append(PorterStemmer().stem(j))
word_stem.append(word_stem_row)
6

In [32]:

word_stem

Out[32]:

[['good', 'need', 'updat', 'improv'],


['worst',
'mobil',
'bought',
'ever',
'batteri',
'drain',
'like',
'hell',
'backup',
'hour',
'internet',
'use',
'even',
'i',
'put',
'mobil',
'idl',

In [33]:

# Lemmatization (use instead of stemming)


nltk.download('wordnet')
word_lemma_row = []
word_lemma = []
for i in word_lower:
word_lemma_row = []
for j in i:
word_lemma_row.append(WordNetLemmatizer().lemmatize(j))
word_lemma.append(word_lemma_row)

[nltk_data] Downloading package wordnet to


[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
7

In [34]:

word_lemma

Out[34]:

[['good', 'need', 'update', 'improvement'],


['worst',
'mobile',
'bought',
'ever',
'battery',
'draining',
'like',
'hell',
'backup',
'hour',
'internet',
'us',
'even',
'i',
'put',
'mobile',
'idle',

In [37]:

# Apply pos tagging


nltk.download('averaged_perceptron_tagger')
word_tag_row = []
word_tag = []
for i in word_lemma:
word_tag.append(nltk.pos_tag(i))

[nltk_data] Downloading package averaged_perceptron_tagger to


[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
8

In [38]:

word_tag

Out[38]:

[[('good', 'JJ'), ('need', 'NN'), ('update', 'JJ'), ('improvement', 'N


N')],
[('worst', 'RB'),
('mobile', 'NN'),
('bought', 'VBD'),
('ever', 'RB'),
('battery', 'RB'),
('draining', 'VBG'),
('like', 'IN'),
('hell', 'NN'),
('backup', 'IN'),
('hour', 'NN'),
('internet', 'NN'),
('us', 'PRP'),
('even', 'RB'),
('i', 'VBP'),
('put', 'VBP'),
('mobile', 'JJ'),

In [40]:

# for understanding the context, filter the nouns


filtered_tag = []
for i in word_tag:
filtered_row = []
for j in i:
if j[1].startswith('NN'):
filtered_row.append(j[0])
filtered_tag.append(filtered_row)

In [41]:

filtered_tag

Out[41]:

[['need', 'improvement'],
['mobile',
'hell',
'hour',
'internet',
'lie',
'amazon',
'lenove',
'battery',
'mah',
'booster',
'charger',
'hour',
'don',
'please',
'regret'],
['cash'],
[],
9

In [44]:

# Generating a word cloud


words = []
for i in filtered_tag:
for j in i:
words.append(j)

In [45]:

words

Out[45]:

['need',
'improvement',
'mobile',
'hell',
'hour',
'internet',
'lie',
'amazon',
'lenove',
'battery',
'mah',
'booster',
'charger',
'hour',
'don',
'please',
'regret',
'cash',

In [46]:

# Convert to string format


word_str = ','.join(words)
word_str

Out[46]:

'need,improvement,mobile,hell,hour,internet,lie,amazon,lenove,battery,mah,
booster,charger,hour,don,please,regret,cash,phone,everthey,phone,problem,p
hone,amazon,i,buyi,batterypoor,camerawaste,money,phone,awesome,heat,allot,
reason,hate,lenovo,k,note,battery,level,worn,problem,phone,problem,lenovo,
k,note,service,station,year,warranty,change,phone,lenovo,lot,glitch,dont,t
hing,option,wrost,phone,charger,damage,month,purchase,item,heating,batter
y,life,i,battery,problem,motherboard,problem,month,life,phone,slim,battry,
backup,screen,love,headset,time,product,range,specification,comparison,ran
ge,i,phone,amazon,seal,i,i,credit,card,i,r,deal,amazon,battery,i,solution,
battery,life,smartphone,galery,problem,atmos,speaker,phone,camera,speed,fe
ature,excelent,battery,product,product,camera,battery,phone,product,lenov
o,option,cast,screen,call,option,doesn,hotspot,phone,usb,cable,phone,pric
e,lenovo,display,specification,function,phone,i,fon,i,speekars,i,phone,iss
ue,color,screen,oreo,battery,heating,problem,phone,battery,update,oreo,si
m,customer,service,performance,battery,get,camera,backup,bestin,pricefull,
passa,wasole,phone,phone,performance,signal,restarts,phone,bcoms,plzz,don
t,buy,round,performance,r,day,trust,deal,amazon,disappointment,problem,hea
dache,problem,call,range,phone,rate,camera,quality,product,mobile,price,fe
10

In [63]:

wordcloud = WordCloud(background_color = 'black', max_words = 10000, contour_width = 3, con

In [64]:

wc = wordcloud.generate(word_str)

In [65]:

import matplotlib.pyplot as plt


plt.figure(figsize = (50,50))
wc.to_image()

Out[65]:

<Figure size 3600x3600 with 0 Axes>

In [66]:

# Change format for COuntVectorizer


str_list = []
for i in filtered_tag:
str_list.append(','.join(i))

10/12
11

In [67]:

str_list

Out[67]:

['need,improvement',
'mobile,hell,hour,internet,lie,amazon,lenove,battery,mah,booster,charger,
hour,don,please,regret',
'cash',
'',
'phone,everthey,phone,problem,phone,amazon',
'i,buyi,batterypoor,camerawaste,money',
'phone,awesome,heat,allot,reason,hate,lenovo,k,note',
'battery,level,worn',
'problem,phone,problem,lenovo,k,note,service,station,year,warranty,chang
e,phone,lenovo',
'lot,glitch,dont,thing,option',
'wrost',
'phone,charger,damage,month',
'purchase,item,heating,battery,life',
'i,battery,problem,motherboard,problem,month,life',
'phone,slim,battry,backup,screen,love',
'headset',

In [68]:

v1 = CountVectorizer().fit(str_list)

In [69]:

v1.get_feature_names()

Out[69]:

['aa',
'aab',
'aachha',
'aaj',
'aajata',
'aal',
'aap',
'aapka',
'aapki',
'aapko',
'aapne',
'aaps',
'aapse',
'aashiyana',
'aata',
'aate',
'aati',
'aavashyakta',

In [70]:

v2 = v1.transform(str_list)
12

In [71]:

v2.toarray()

Out[71]:

array([[0, 0, 0, ..., 0, 0, 0],


[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [72]:

from sklearn.model_selection import train_test_split


x_tr, x_test, y_tr, y_test = train_test_split(v2, data['sentiment'])

In [73]:

# Applying NB
nb_mod = MultinomialNB().fit(x_tr,y_tr)

In [74]:

pred = nb_mod.predict(x_test)
pred

Out[74]:

array([1, 1, 0, ..., 1, 1, 0], dtype=int64)

In [75]:

from sklearn.metrics import accuracy_score


accuracy_score(y_test, pred)

Out[75]:

0.7176342327609703

In [ ]:
1

In [ ]:

# ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science)
#University of Mumbai
#PRACTICAL 10 Implement Regression

In [13]:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pylab

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest,chi2,f_classif,f_regression
from sklearn.metrics import r2_score

In [14]:

df = pd.read_csv("C:/Users/archa/Downloads/Admission_Predict.csv")
df.head()

Out[14]:

Serial GRE TOEFL University Chance of


SOP LOR CGPA Research
No. Score Score Rating Admit

0 1 337 118 4 4.5 4.5 9.65 1 0.92

1 2 324 107 4 4.0 4.5 8.87 1 0.76

2 3 316 104 3 3.0 3.5 8.00 1 0.72

3 4 322 110 3 3.5 2.5 8.67 1 0.80

4 5 314 103 2 2.0 3.0 8.21 0 0.65

In [15]:

df.dtypes

Out[15]:

Serial No. int64


GRE Score int64
TOEFL Score int64
University Rating int64
SOP float64
LOR float64
CGPA float64
Research int64
Chance of Admit float64
dtype: object
2

In [16]:

df.drop('Serial No.',axis=1,inplace=True)

In [17]:

df.describe()

Out[17]:

TOEFL University C
GRE Score SOP LOR CGPA Research
Score Rating

count 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 40

mean 316.807500 107.410000 3.087500 3.400000 3.452500 8.598925 0.547500

std 11.473646 6.069514 1.143728 1.006869 0.898478 0.596317 0.498362

min 290.000000 92.000000 1.000000 1.000000 1.000000 6.800000 0.000000

25% 308.000000 103.000000 2.000000 2.500000 3.000000 8.170000 0.000000

50% 317.000000 107.000000 3.000000 3.500000 3.500000 8.610000 1.000000

75% 325.000000 112.000000 4.000000 4.000000 4.000000 9.062500 1.000000

max 340.000000 120.000000 5.000000 5.000000 5.000000 9.920000 1.000000


3

In [18]:

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True,cmap='tab10')

Out[18]:

<AxesSubplot:>
4

In [19]:

sns.pairplot(df)

Out[19]:

<seaborn.axisgrid.PairGrid at 0x1412c3a4580>
5

In [20]:

#Multiple Linear regression


X = df.drop('Chance of Admit ',axis=1)
Y=df['Chance of Admit ']

In [21]:

x_tr,x_test,y_tr,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

In [22]:

lm1=LinearRegression().fit(x_tr,y_tr)

In [23]:

pred = lm1.predict(x_test)

In [24]:

acc=r2_score(y_test,pred)
acc

Out[24]:

0.7792013613144768

In [25]:

Accuracy_Table = pd.DataFrame({'Model_Name':['Multiple_Linear_Regression'],'Accuracy':[acc]

In [26]:

resid = np.array(pred)-np.array(y_test)
6

In [27]:

import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()

FEATURE SELECTION
In [30]:

X = df.drop('Chance of Admit ',axis=1)


Y=df['Chance of Admit '].values

In [31]:

X_scaled=StandardScaler().fit_transform(X)
7

In [32]:

X_scaled

Out[32]:

array([[ 1.76210664, 1.74697064, 0.79882862, ..., 1.16732114,


1.76481828, 0.90911166],
[ 0.62765641, -0.06763531, 0.79882862, ..., 1.16732114,
0.45515126, 0.90911166],
[-0.07046681, -0.56252785, -0.07660001, ..., 0.05293342,
-1.00563118, 0.90911166],
...,
[ 1.15124883, 1.41704229, 0.79882862, ..., 1.16732114,
1.42900622, 0.90911166],
[-0.41952842, -0.72749202, -0.07660001, ..., 0.61012728,
0.30403584, -1.09997489],
[ 1.41304503, 1.58200646, 0.79882862, ..., 0.61012728,
1.78160888, 0.90911166]])

In [61]:

fs_model=SelectKBest(f_regression,k=3).fit(X_scaled,Y)

In [62]:

fs_model.get_support(indices=True)

Out[62]:

array([0, 1, 5], dtype=int64)


8

In [63]:

X.iloc[:,[0,1,5]]

Out[63]:

GRE Score TOEFL Score CGPA

0 337 118 9.65

1 324 107 8.87

2 316 104 8.00

3 322 110 8.67

4 314 103 8.21

... ... ... ...

395 324 110 9.04

396 325 107 9.11

397 330 116 9.45

398 312 103 8.78

399 333 117 9.66

400 rows × 3 columns

PCA
In [28]:

pca_mod=PCA(n_components=0.95)

In [33]:

x_pca=pca_mod.fit_transform(X_scaled)

In [34]:

pca_mod.explained_variance_

Out[34]:

array([4.88523075, 0.72669508, 0.54806185, 0.31068105, 0.24582508])


9

In [35]:

import matplotlib.pyplot as plt


plt.plot(range(1,len(pca_mod.explained_variance_)+1),np.cumsum(pca_mod.explained_variance_)

Out[35]:

[<matplotlib.lines.Line2D at 0x14130040970>]

In [36]:

x_tr,x_test,y_tr,y_test = train_test_split(x_pca,Y,test_size=0.2,random_state=100)

In [37]:

lm3=LinearRegression().fit(x_tr,y_tr)

In [39]:

pred=lm3.predict(x_test)

In [40]:

acc=r2_score(pred,y_test)
acc

Out[40]:

0.7564181483461111
10

In [41]:

Accuracy_Table=Accuracy_Table.append({'Model_Name':'PCA','Accuracy':acc},ignore_index=True)

In [42]:

resid = np.array(pred)-np.array(y_test)

In [43]:

import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()

Lasso Linear Regression


In [45]:

from sklearn.linear_model import Lasso

In [46]:

x_tr,x_test,y_tr,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

In [47]:

Lasso_lm=Lasso(alpha=0.00001).fit(x_tr,y_tr)
11

In [48]:

pred=Lasso_lm.predict(x_test)

In [49]:

acc=r2_score(pred,y_test)
acc

Out[49]:

0.7626990495351637

In [50]:

resid = np.array(pred)-np.array(y_test)

In [51]:

import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()

In [52]:

Accuracy_Table=Accuracy_Table.append({'Model_Name':'Lasso_Regression','Accuracy':acc},ignor
12

In [53]:

Accuracy_Table

Out[53]:

Model_Name Accuracy

0 Multiple_Linear_Regression 0.779201

1 PCA 0.756418

2 Lasso_Regression 0.762699

Conclusion : Multiple Linear Regression gives the best


accuracy for the dataset
In [54]:

sns.regplot(x='GRE Score',y='Chance of Admit ',data=df)

Out[54]:

<AxesSubplot:xlabel='GRE Score', ylabel='Chance of Admit '>


13

In [55]:

sns.regplot(x='TOEFL Score',y='Chance of Admit ',data=df)

Out[55]:

<AxesSubplot:xlabel='TOEFL Score', ylabel='Chance of Admit '>


5/21/22, 4:39 PM 14

In [56]:

sns.regplot(x='SOP',y='Chance of Admit ',data=df)

Out[56]:

<AxesSubplot:xlabel='SOP', ylabel='Chance of Admit '>

In [57]:

sns.regplot(x='LOR ',y='Chance of Admit ',data=df)

Out[57]:

<AxesSubplot:xlabel='LOR ', ylabel='Chance of Admit '>


15

In [58]:

sns.regplot(x='CGPA',y='Chance of Admit ',data=df)

Out[58]:

<AxesSubplot:xlabel='CGPA', ylabel='Chance of Admit '>

In [59]:

sns.regplot(x='University Rating',y='Chance of Admit ',data=df)

Out[59]:

<AxesSubplot:xlabel='University Rating', ylabel='Chance of Admit '>


16

In [60]:

sns.regplot(x='Research',y='Chance of Admit ',data=df)

Out[60]:

<AxesSubplot:xlabel='Research', ylabel='Chance of Admit '>

In [ ]:

You might also like