0% found this document useful (0 votes)

14 views98 pages

Algorithm of Data Science Journal

Uploaded by

abhinandanpaul1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views98 pages

Algorithm of Data Science Journal

Uploaded by

abhinandanpaul1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

1

UNIVERSITY OF MUMBAI
DEPARTMENT OF COMPUTER SCIENCE

M.Sc. Computer Science with Spl. in Data Science – Semester II

ALGORITHMS OD DATA SCIENCE
JOURNAL
2021-2022

Seat No.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
2

UNIVERSITY OF MUMBAI
DEPARTMENT OF COMPUTER SCIENCE

CERTIFICATE
This is to certify that the work entered in this journal was done in the University
Department of Computer Science laboratory by
Mr./Ms. ARCHANA SUKUMARAN NAIR, Seat No.
for the course of M.Sc. Computer Science with Spl. in Data Science - Semester
II (CBCS) (Revised) during the academic year 2021-2022 in a satisfactory
manner.

Subject In-charge Head of Department

External Examiner

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
3

Index

Sr. no. Name of the practical Page No. Date Sign

1 HADOOP HDFS 4

2 WORD COUNT APPLICATION 10

IN MAPREDUCE(APACHE PIG)
3 KMEANS CLUSTERING 16
ALGORITHM
4 HIERARCHICAL CLUSTERING 27

5 KNN CLASSIFIER 36

6 KNN REGRESSOR 46

7 TIME SERIES 53

8 PREDICTONG AUTHORS OF 62
DISPUTED FEDERALIST PAPER
9 SENTIMENT ANALYSIS 72

10 REGRESSION ANALYSIS 84

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
4

NAME : ARCHANA NAIR

SUBJECT : Algorithm for Data Science
COURSE : M.Sc. Computer Science with Specialization in Data Science
PRACTICAL 1

1. Start Hadoop Namenode and Datanode

NameNode is the master node in Hadoop Distributed File System (HDFS)

that manages the file system metadata while the DataNode is a slave node
in Hadoop distributed file system that stores the actual data as instructed by
the NameNode.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
5

2. Print your Hadoop version

Hadoop version command gives us the information about which version of hadoop
is currently installed in the system.

3. Print the nodes which are running

Command JPS shows which nodes are currently running for the hadoop file system.

4. Create a directory Hadoop-assign-1 in HDFS

hadoop fs mkdir command creates a directory in the hadoop file system at home
location.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
6

5. Create an empty file my-hadoop-assign.txt

hadoop fs -touchz command creates an empty text file at location.

6. Create a file called “hare_story.txt” in C:\ with the

content copied from the story Hare And Tortoise Story -
Bedtimeshortstories

Hadoop fs -put command is used to store copy at desired location.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
7

7. Append the content from the local file

“hare_story.txt” to “my-hadoop-assign.txt” in
HDFS.

Command appendToFile is used to add content of file which is in local system to the
hadoop file system.

8. Copy any local file to HDFS directly

In local system we can create a file and then it is copied to the hadoop file system.

9. View the content of HDFS with subfolders and

files.

fs -ls -R / command show all the directories and sub-directories of the hadoop file
system.

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
8

10. Rename the “my-hadoop-assign.txt” to “my-

homework.txt”
Fs -mv / command renames a file name of hadoop file system.

11. View the size of the files in HDFS in terms of

KB .

Fs -du -h / command shows each of the file sizes of hadoop file system.

12. Show the web UI with the files in storage.

Web UI of the hadoop file storage which is being accessed by http:/ip-address/9870

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
9

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
10

NAME : ARCHANA NAIR

SUBJECT : Algorithm for Data Science
COURSE : M.Sc. Computer Science with Specialization in
Data Science
PRACTICAL 2
Use Apache PIG to create word count . Write the commands and display
output at each step.

A. Start hadoop dfs and yarn :

hadoop start-dfs.cmd

hadoop start-yarn.cmd

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
11

B. Copy local file to hdfs :

hadoop fs -put /C:/Users/verma/Downloads/hare.txt /

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
12

C. Loading lines of a text file from hdfs and storing in a variable

lines = load 'hdfs://localhost:9000/hare.txt' using TextLoader as
(line:chararray);
DUMP lines;

D. Seperate each words of each line into tokens ( tuples )

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
13

words = foreach lines generate flatten(TOKENIZE(line)) as word;

DUMP words;

E. Categorising repeated words in the same group

groups = group words by word;
DUMP groups;

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
14

F. Count the number of words of each group

count = foreach groups generate group, (COUNT(words));
DUMP count;

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
15

MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
16

ARCHANA NAIR
M.Sc. Computer Science (With Specialization in Data Science) University of Mumbai PRACTICAL 3 :- K-Means
Clustering

In [2]:

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [22]:

data = pd.read_csv("C:/Users/archa/Downloads/Mall_Customers.csv")
data.head()

Out[22]:

CustomerID Genre Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

In [4]:

data = data.drop(columns = ['CustomerID'])

data.head()

Out[4]:

Genre Age Annual Income (k$) Spending Score (1-100)

0 Male 19 15 39

1 Male 21 15 81

2 Female 20 16 6

3 Female 23 16 77

4 Female 31 17 40
17

In [23]:

data.dtypes

Out[23]:

CustomerID int64
Genre object
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object

In [24]:

#analyse missing values

data.isna().sum()

Out[24]:

CustomerID 0
Genre 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64

In [ ]:
18

In [5]:

sns.pairplot(data)

Out[5]:

<seaborn.axisgrid.PairGrid at 0x20172d70730>
19

In [6]:

data.columns

Out[6]:

Index(['Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'], dtyp

e='object')
20

In [7]:

col = ['Genre', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)']

for i in col:
plt.figure(figsize =(5,3), dpi = 100)
plt.hist(x = i, data = data)
plt.xlabel(col)
plt.show()
21

In [10]:

data['Genre'] = data['Genre'].map({'Male':1,'Female':2 }) # we can use get_dummies instead

data.head()

Out[10]:

Genre Age Annual Income (k$) Spending Score (1-100)

0 1 19 15 39

1 1 21 15 81

2 2 20 16 6

3 2 23 16 77

4 2 31 17 40

In [11]:

data.dtypes

Out[11]:

Genre int64
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object

In [26]:

#scaling transformation
#1.zscore normalization using standardScalar(Same mean)
#2.Minmax normalization using MinMax Scaler(0 to 1)
22

In [12]:

df_customer = data.iloc[:,2:4]
df_customer.head()

Out[12]:

Annual Income (k$) Spending Score (1-100)

0 15 39

1 15 81

2 16 6

3 16 77

4 17 40

In [13]:

from sklearn.preprocessing import StandardScaler

data_scaled = StandardScaler().fit_transform(df_customer)
data_scaled

Out[13]:

array([[-1.73899919, -0.43480148],
[-1.73899919, 1.19570407],
[-1.70082976, -1.71591298],
[-1.70082976, 1.04041783],
[-1.66266033, -0.39597992],
[-1.66266033, 1.00159627],
[-1.62449091, -1.71591298],
[-1.62449091, 1.70038436],
[-1.58632148, -1.83237767],
[-1.58632148, 0.84631002],
[-1.58632148, -1.4053405 ],
[-1.58632148, 1.89449216],
[-1.54815205, -1.36651894],
[-1.54815205, 1.04041783],
[-1.54815205, -1.44416206],
[-1.54815205, 1.11806095],
[-1.50998262, -0.59008772],
[-1.50998262, 0.61338066],

In [14]:

# Finding the optimal number of K

wcss= []
for i in range(2,11):
kmodel = KMeans(n_clusters = i, init = 'random')
kmodel.fit(data_scaled)
wcss.append(kmodel.inertia_)
23

In [15]:

wcss

Out[15]:

[269.01679374906655,
157.70400815035939,
108.92131661364358,
65.56840815571681,
55.103778121150555,
44.86475569922555,
37.24321153347672,
33.85792110528426,
30.684270071530346]

In [16]:

#Plotting
plt.figure(figsize = (8,6), dpi=100)
plt.plot(range(2,11),wcss, marker = 'o', c='blue', markerfacecolor='red')
plt.xlabel('No of Clusters')
plt.ylabel('WCSS')
plt.show()

In [17]:

# Creating the final Kmeans model with no of clusters = 5

Kmodel_final = KMeans(n_clusters = 5, init = 'k-means++').fit(data_scaled)
24

In [18]:

cl = Kmodel_final.predict(data_scaled)

In [19]:

Out[19]:

array([0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3,
0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 1,
0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 2, 1, 2, 4, 2, 4, 2,
1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])

In [20]:

# Adding the clusters to a new column in the dataset

df_customer['cl']=cl
df_customer.head()

Out[20]:

Annual Income (k$) Spending Score (1-100) cl

0 15 39 0

1 15 81 3

2 16 6 0

3 16 77 3

4 17 40 0
25

In [27]:

# Visualization of clusters
plt.figure(figsize = (6,4), dpi = 100)
plt.scatter(x=df_customer['Annual Income (k$)'],y=df_customer['Spending Score (1-100)'],c=c
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score')
plt.show()

cl 1=high income low spender c2 = high income high spender c3 = low income high spender c4 = low income
low spender c5 = moderate income moderate spender

Conclusion
Mall customer data is clustered into 5 clusters. The green cluster indicates the people who have high spending
score but a low annual income. The purple cluster shows the people who have a low annual income & low
M 26

spending score. The blue cluster shows people who have an average annual income & average spending
score. The sea green cluster indicate the people who have high annual income & high spending score. The
yellow cluster shows the people who have low spending score & high annual income.

In [ ]:
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 4 :-
Heirarchical Clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import cdist
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [3]:

data = pd.read_csv('C:/Users/archa/Downloads/USArrests.csv')
data.head()

Out[3]:

Unnamed: 0 Murder Assault UrbanPop Rape

0 Alabama 13.2 236 58 21.2

1 Alaska 10.0 263 48 44.5

2 Arizona 8.1 294 80 31.0

3 Arkansas 8.8 190 50 19.5

4 California 9.0 276 91 40.6

In [4]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 50 non-null object
1 Murder 50 non-null float64
2 Assault 50 non-null int64
3 UrbanPop 50 non-null int64
4 Rape 50 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 2.1+ KB
2

In [6]:

data.isna().sum()

Out[6]:

Unnamed: 0 0
Murder 0
Assault 0
UrbanPop 0
Rape 0
dtype: int64

In [7]:

data.shape

Out[7]:

(50, 5)
3

In [8]:

data['Unnamed: 0'].value_counts()

Out[8]:

Georgia 1
Nevada 1
Maryland 1
Hawaii 1
North Dakota 1
Nebraska 1
Rhode Island 1
Missouri 1
Oregon 1
Virginia 1
North Carolina 1
New Hampshire 1
Indiana 1
Idaho 1
New Mexico 1
Florida 1
Oklahoma 1
Arizona 1
Delaware 1
New Jersey 1
Montana 1
Colorado 1
Illinois 1
Vermont 1
Tennessee 1
Arkansas 1
Kansas 1
Ohio 1
Massachusetts 1
South Dakota 1
Louisiana 1
Kentucky 1
Utah 1
Minnesota 1
Alabama 1
West Virginia 1
Washington 1
Pennsylvania 1
Wisconsin 1
Connecticut 1
Texas 1
Wyoming 1
Mississippi 1
California 1
Michigan 1
Iowa 1
South Carolina 1
New York 1
Maine 1
Alaska 1
Name: Unnamed: 0, dtype: int64
4

In [16]:

data.describe()

Out[16]:

Murder Assault UrbanPop Rape

count 50.00000 50.000000 50.000000 50.000000

mean 7.78800 170.760000 65.540000 21.232000

std 4.35551 83.337661 14.474763 9.366385

min 0.80000 45.000000 32.000000 7.300000

25% 4.07500 109.000000 54.500000 15.075000

50% 7.25000 159.000000 66.000000 20.100000

75% 11.25000 249.000000 77.750000 26.175000

max 17.40000 337.000000 91.000000 46.000000

In [17]:

#Renaming the column that contains the states

data.rename(columns = {'Unnamed: 0':'States'}, inplace = True)
data.head()

Out[17]:

States Murder Assault UrbanPop Rape

0 Alabama 13.2 236 58 21.2

1 Alaska 10.0 263 48 44.5

2 Arizona 8.1 294 80 31.0

3 Arkansas 8.8 190 50 19.5

4 California 9.0 276 91 40.6

In [18]:

# Checking the murder rate

plt.figure(figsize = (20,5))
data.groupby('States')['Murder'].max().plot(kind ='bar')
plt.show()
5

In [34]:

# Checking the assault rate

plt.figure(figsize = (20,5))
data.groupby('States')['Assault'].max().plot(kind ='bar')
plt.show()

In [19]:

# Converting the States column to numeric using get_dummies from pandas library
data_num = pd.get_dummies(data, columns = ['States'])
data_num.head()

Out[19]:

Murder Assault UrbanPop Rape States_Alabama States_Alaska States_Arizona States_A

0 13.2 236 58 21.2 1 0 0

1 10.0 263 48 44.5 0 1 0

2 8.1 294 80 31.0 0 0 1

3 8.8 190 50 19.5 0 0 0

4 9.0 276 91 40.6 0 0 0

5 rows × 54 columns
20

# Using average linkage as the distance metrics

link = linkage(data_num, 'average')
link

Out[20]:

array([[ 14. , 28. , 2.6925824 , 2. ],

[ 16. , 25. , 4.08656335, 2. ],
[ 13. , 15. , 4.1761226 , 2. ],
[ 12. , 31. , 6.39531078, 2. ],
[ 34. , 43. , 6.7867518 , 2. ],
[ 35. , 45. , 7.48999332, 2. ],
[ 6. , 37. , 8.15107355, 2. ],
[ 18. , 40. , 8.65390085, 2. ],
[ 48. , 50. , 10.2823621 , 3. ],
[ 49. , 55. , 10.83611024, 3. ],
[ 47. , 57. , 10.86685976, 3. ],
[ 20. , 29. , 11.54339638, 2. ],
[ 26. , 51. , 12.51925264, 3. ],
[ 3. , 41. , 12.69330532, 2. ],
[ 36. , 59. , 12.95749268, 4. ],
[ 33. , 44. , 13.12135664, 2. ],
[ 21. , 27. , 13.37235955, 2. ],
[ 52. , 56. , 13.42862855, 4. ],
[ 2. , 30. , 13.96782016, 2. ],
[ 5. , 42. , 14.56983185, 2. ],
[ 11. , 62. , 15.09585446, 4. ],
[ 54. , 67. , 15.19058848, 6. ],
[ 19. , 68. , 15.51774813, 3. ],
[ 0. , 17. , 15.51902059, 2. ],
[ 46. , 64. , 16.49150054, 5. ],
[ 7. , 73. , 16.95059831, 3. ],
[ 53. , 66. , 18.4731512 , 4. ],
[ 22. , 58. , 19.04598906, 4. ],
[ 24. , 63. , 20.2507332 , 3. ],
[ 70. , 71. , 20.65162046, 10. ],
[ 23. , 39. , 21.21438191, 2. ],
[ 38. , 61. , 22.64143504, 3. ],
[ 9. , 69. , 24.01391232, 3. ],
[ 75. , 76. , 26.40399464, 7. ],
[ 74. , 81. , 26.7531159 , 8. ],
[ 65. , 77. , 27.81929489, 6. ],
[ 4. , 72. , 28.04933353, 4. ],
[ 1. , 80. , 28.13138593, 3. ],
[ 78. , 82. , 29.08942271, 6. ],
[ 60. , 85. , 33.14911215, 9. ],
[ 8. , 32. , 38.55385843, 2. ],
[ 83. , 87. , 39.42094783, 10. ],
[ 10. , 89. , 41.12189846, 10. ],
[ 86. , 91. , 44.30793939, 14. ],
[ 84. , 88. , 44.86221876, 14. ],
[ 79. , 92. , 54.76706367, 20. ],
[ 90. , 93. , 77.61895275, 16. ],
[ 94. , 95. , 89.24555288, 34. ],
[ 96. , 97. , 152.3219326 , 50. ]])
7

# Implementing the agglomerative clustering

dendrogram(link)
plt.title('Truncated Hierarchical clustering')
plt.xlabel('CLuster Size')
plt.ylabel('distance')
plt.show()

In [23]:

k = 3
h_cluster = AgglomerativeClustering(n_clusters = k, affinity = 'euclidean', linkage = 'ward
h_cluster.fit(data_num)

Out[23]:

AgglomerativeClustering(n_clusters=3)

In [24]:

cluster = h_cluster.fit_predict(data_num)

In [26]:

# Silhouette Score
print('Silhouette Score: %0.3f' % metrics.silhouette_score(data_num,h_cluster.labels_))

Silhouette Score: 0.532

K MEANS CLUSTERING
M 8

In [27]:

# Finding the optimal number of clusters

wcss = []
for i in range(2,9):
kmodel = KMeans(n_clusters = i, init = 'random')
kmodel.fit(data_num)
wcss.append(kmodel.inertia_)

In [28]:

wcss

Out[28]:

[96447.02814449916,
48011.26535714287,
34774.629357142854,
29124.6065,
19001.82888888889,
16633.143809523808,
13960.15160714286]

In [29]:

# Elbow plot
plt.figure(figsize = (6,4), dpi = 100)
plt.plot(range(2,9), wcss, marker = 'o', c = 'blue')
plt.xlabel('No of clusters')
plt.ylabel('WCSS')
plt.show()
9

In [30]:
# Final model with 5 clusters
kmod_final = KMeans(n_clusters = 5, init = 'k-means++').fit(data_num)
cl = kmod_final.predict(data_num)
cl

Out[30]:

array([0, 0, 4, 3, 0, 3, 1, 0, 4, 3, 2, 1, 0, 1, 2, 1, 1, 0, 2, 4, 3, 0,
2, 0, 3, 1, 1, 0, 2, 3, 0, 0, 4, 2, 1, 3, 3, 1, 3, 0, 2, 3, 3, 1,
2, 3, 3, 2, 2, 3])

In [31]:

# Silhouette Score
from sklearn.metrics import silhouette_score
score = silhouette_score(data_num, kmod_final.labels_, metric='euclidean')

In [32]:

score

Out[32]:

0.4486735234754001

In [ ]:
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 5 :-
Implement KNN & PCA

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

In [3]:

titan_train = pd.read_csv("C:/Users/archa/Downloads/titan_train.csv")
titan_train.head()

Out[3]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...

Heikkinen, STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
2

In [5]:

titan_test = pd.read_csv("C:/Users/archa/Downloads/titan_test.csv")
titan_test.head()

Out[5]:

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Kelly, Mr.
0 892 3
male 34.5 0 0 330911 7.8292 NaN
James
Wilkes,
Mrs.
1 893 3 James female 47.0 1 0 363272 7.0000 NaN
(Ellen
Needs)

Myles, Mr.
2 894 2 Thomas male 62.0 0 0 240276 9.6875 NaN
Francis
Wirz, Mr.
3 895 3
male 27.0 0 0 315154 8.6625 NaN
Albert
Hirvonen,
Mrs.
4 896 3 Alexander female 22.0 1 1 3101298 12.2875 NaN
(Helga E
Lindqvist)

In [6]:

titan_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
3

In [7]:

y_train = titan_train['Survived']

# Studying the data values

y_train.value_counts()

Out[7]:

0 549
1 342
Name: Survived, dtype: int64

In [8]:

# Checking the cabin feature

titan_train['Cabin'].isna().sum()/titan_train.shape[0]

Out[8]:

0.7710437710437711

In [9]:

# Dropping the columns that are not required

titan_train.drop(['Name','PassengerId','Ticket','Cabin'], axis = 1, inplace = True)
titan_test.drop(['Name','PassengerId','Ticket','Cabin'], axis = 1, inplace = True)
titan_train.head()

Out[9]:

Survived Pclass Sex Age SibSp Parch Fare Embarked

0 0 3 male 22.0 1 0 7.2500 S

1 1 1 female 38.0 1 0 71.2833 C

2 1 3 female 26.0 0 0 7.9250 S

3 1 1 female 35.0 1 0 53.1000 S

4 0 3 male 35.0 0 0 8.0500 S

In [10]:

# Dealing with missing values in the training set and test set
titan_train['Embarked'] = titan_train['Embarked'].fillna('S')
titan_train['Age'] = titan_train['Age'].fillna(titan_train['Age'].mean())
titan_test['Embarked'] = titan_test['Embarked'].fillna('S')
titan_test['Age'] = titan_test['Age'].fillna(titan_test['Age'].mean())
titan_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null object
3 Age 891 non-null float64
4 SibSp 891 non-null int64
5 Parch 891 non-null int64
6 Fare 891 non-null float64
7 Embarked 891 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB

In [12]:

Converting the categorical variables

itan_train = pd.get_dummies(titan_train, columns = ['Sex', 'Embarked'], drop_first = False)
itan_test = pd.get_dummies(titan_test, columns = ['Sex', 'Embarked'], drop_first = False)
itan_train.head()

Out[12]:

Survived Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embar

0 0 3 22.0 1 0 7.2500 0 1 0

1 1 1 38.0 1 0 71.2833 1 0 1

2 1 3 26.0 0 0 7.9250 1 0 0

3 1 1 35.0 1 0 53.1000 1 0 0

4 0 3 35.0 0 0 8.0500 0 1 0
5

In [13]:

titan_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Age 418 non-null float64
2 SibSp 418 non-null int64
3 Parch 418 non-null int64
4 Fare 417 non-null float64
5 Sex_female 418 non-null uint8
6 Sex_male 418 non-null uint8
7 Embarked_C 418 non-null uint8
8 Embarked_Q 418 non-null uint8
9 Embarked_S 418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.5 KB

In [14]:

# treating the missing value in the test set

titan_test['Fare'] = titan_test['Fare'].fillna(titan_test['Fare'].mean())
titan_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Age 418 non-null float64
2 SibSp 418 non-null int64
3 Parch 418 non-null int64
4 Fare 418 non-null float64
5 Sex_female 418 non-null uint8
6 Sex_male 418 non-null uint8
7 Embarked_C 418 non-null uint8
8 Embarked_Q 418 non-null uint8
9 Embarked_S 418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.5 KB

In [15]:

y_train.shape

Out[15]:

(891,)
6

In [17]:

# Scaling the data

scaled_titan_train = StandardScaler().fit_transform(titan_train)
scaled_titan_test = StandardScaler().fit_transform(titan_test)

In [19]:

# Applying PCA
pca_model = PCA(n_components = 0.95)
pca_scaled_titan = pca_model.fit_transform(scaled_titan_train)
pca_model.explained_variance_ratio_

Out[19]:

array([0.26196203, 0.17931188, 0.1521482 , 0.12721991, 0.0876279 ,

0.06664062, 0.05123602, 0.04365245])

In [20]:

# Plotting to see how many components shoud be used

plt.figure(figsize = (8,7))
plt.plot(np.cumsum(pca_model.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Expalined Variance')
plt.show()
7

In [21]:

pca_model.n_components_

Out[21]:

In [23]:

data into training and validation data

x_test, val_train, val_test = train_test_split(pca_scaled_titan, y_train, test_size = 0.3)
shape, x_test.shape, val_train.shape, val_test.shape

Out[23]:

((623, 8), (268, 8), (623,), (268,))

In [24]:

# Find the optimal number of k

params = {'n_neighbors': range(1, 10)}
gscv_model = GridSearchCV(KNeighborsClassifier(), params)

In [27]:

# Fitting the model

gscv_model.fit(titan_train, val_train)

Out[27]:

GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 10)})

In [28]:

# best value of k
gscv_model.best_params_

Out[28]:

{'n_neighbors': 1}

In [29]:

# Applying KNN
knn_model = KNeighborsClassifier(n_neighbors = 1).fit(titan_train, val_train)
pred = knn_model.predict(x_test)
8

In [30]:

# confusion matrix
cm = confusion_matrix(val_test, pred)
print(cm)

[[165 1]
[ 0 102]]

In [31]:

# Classification report
print(classification_report(val_test, pred))

precision recall f1-score support

0 1.00 0.99 1.00 166

1 0.99 1.00 1.00 102

accuracy 1.00 268

macro avg 1.00 1.00 1.00 268
weighted avg 1.00 1.00 1.00 268

In [32]:

# testing on test data

titan_test.shape

Out[32]:

(418, 10)

In [33]:

# Applying pca on the test data

pca_mod = PCA(n_components= 8)

In [34]:

pca_test = pca_mod.fit_transform(scaled_titan_test)

In [35]:

pca_mod.explained_variance_ratio_

Out[35]:

array([0.25455619, 0.1968992 , 0.17360053, 0.12081349, 0.09336117,

0.06958407, 0.05654625, 0.03463909])
9

In [36]:

plt.figure(figsize = (8,7))
plt.plot(np.cumsum(pca_mod.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show()

In [37]:

pca_mod.n_components_

Out[37]:

8
10

In [39]:

pred2 = knn_model.predict(pca_test)
pred2

Out[39]:

array([0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,
1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
dtype=int64)

In [ ]:

b 10/10
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 6 :-
Implement KNN-REGRESSOR

In [2]:

# Importing the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as mse
from sklearn.neighbors import KNeighborsRegressor
from math import sqrt
from sklearn.model_selection import GridSearchCV

In [2]:

# Reading the data

df_data = pd.read_csv('C:/Users/archa/Downloads/Admission_Predict.csv')
df_data.head()

Out[2]:

Serial GRE TOEFL University Chance of

SOP LOR CGPA Research
No. Score Score Rating Admit

0 1 337 118 4 4.5 4.5 9.65 1 0.92

1 2 324 107 4 4.0 4.5 8.87 1 0.76

2 3 316 104 3 3.0 3.5 8.00 1 0.72

3 4 322 110 3 3.5 2.5 8.67 1 0.80

4 5 314 103 2 2.0 3.0 8.21 0 0.65

In [3]:

df_data.shape

Out[3]:

(400, 9)
2

In [8]:

df_data.columns

Out[8]:

Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',

'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
dtype='object')

In [6]:

df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 400 non-null int64
1 GRE Score 400 non-null int64
2 TOEFL Score 400 non-null int64
3 University Rating 400 non-null int64
4 SOP 400 non-null float64
5 LOR 400 non-null float64
6 CGPA 400 non-null float64
7 Research 400 non-null int64
8 Chance of Admit 400 non-null float64
dtypes: float64(4), int64(5)
memory usage: 28.2 KB

In [10]:

# removing the spaces between the column names

df_data.columns = ['Serial_No.', 'GRE_Score', 'TOEFL_Score', 'University_Rating', 'SOP',
'LOR ', 'CGPA', 'Research', 'Chance_of_Admit']
3

In [12]:

# Checking the correlation

plt.figure(figsize = (8,6))
sns.heatmap(df_data.corr(), annot = True)
plt.show()
4

In [16]:

# Dropping teh variables that are not required

df_data.drop(['Serial_No.'], axis = 1, inplace = True)
df_data.head()

Out[16]:

GRE_Score TOEFL_Score University_Rating SOP LOR CGPA Research Chance_of_Admi

0 337 118 4 4.5 4.5 9.65 1 0.9

1 324 107 4 4.0 4.5 8.87 1 0.7

2 316 104 3 3.0 3.5 8.00 1 0.7

3 322 110 3 3.5 2.5 8.67 1 0.8

4 314 103 2 2.0 3.0 8.21 0 0.6

In [18]:

# Splitting the independent and dependent variables

x = df_data.drop('Chance_of_Admit', axis = 1)
y = df_data['Chance_of_Admit']
x.head()

Out[18]:

GRE_Score TOEFL_Score University_Rating SOP LOR CGPA Research

0 337 118 4 4.5 4.5 9.65 1

1 324 107 4 4.0 4.5 8.87 1

2 316 104 3 3.0 3.5 8.00 1

3 322 110 3 3.5 2.5 8.67 1

4 314 103 2 2.0 3.0 8.21 0

In [19]:

# Scaling the data

scaler = StandardScaler()
scaled_x = scaler.fit_transform(x)

In [20]:

# Splitting the data into train and test set

x_tr, x_test, y_tr, y_test = train_test_split(scaled_x, y)
x_tr.shape, x_test.shape, y_tr.shape, y_test.shape

Out[20]:

((300, 7), (100, 7), (300,), (100,))

In [21]:

# Applying Linear regression

lr_model = LinearRegression().fit(x_tr, y_tr)

In [22]:

pred_1 = lr_model.predict(x_test)
pred_1

Out[22]:

array([0.67909121, 0.95551592, 0.72894865, 0.48583543, 0.53852174,

0.67092294, 0.8204142 , 0.7783607 , 0.64455292, 0.46479729,
0.70176684, 0.65919873, 0.60116658, 0.91029432, 0.70461871,
0.61148716, 0.74706861, 0.7249588 , 0.94875688, 0.82792022,
0.77842265, 0.86015917, 0.52533426, 0.95098272, 0.8373398 ,
0.54278208, 0.58086278, 0.69831249, 0.41788173, 0.58148695,
0.63403675, 0.51614761, 0.72443846, 1.0057703 , 0.80674416,
0.5145337 , 1.00772297, 0.54703292, 0.68594236, 0.94203988,
0.77378915, 0.66626015, 0.53689877, 0.66415535, 0.79274928,
0.85642724, 0.6455203 , 0.63928183, 0.72364687, 0.73350815,
0.90680501, 0.89234368, 0.64048648, 0.93967212, 0.83906353,
0.59857466, 0.84595632, 0.75492832, 0.61715465, 0.75846671,
0.56496637, 0.93930258, 0.67270008, 0.6913754 , 0.64666463,
0.73590017, 0.7138422 , 0.7221923 , 0.46382825, 0.58880676,
0.58715168, 0.63054123, 0.82516573, 0.7874021 , 0.73647047,
0.67348602, 0.59966386, 0.77876368, 0.62996058, 0.74879008,
0.75509647, 0.51474536, 0.79239247, 0.69871467, 0.70744775,
0.64914752, 0.67588826, 0.61908792, 0.7867352 , 0.78725215,
0.61198433, 0.96166414, 0.64506994, 0.71264468, 0.94441163,
0.42890377, 0.68670028, 0.64288935, 0.80816794, 0.6564213 ])

In [23]:

# Checking the r2_score

r2_score(y_test, pred_1)

Out[23]:

0.7693406066446685

In [24]:

# Root mean squared error

sqrt(mse(y_test, pred_1))

Out[24]:

0.0674866003456461

KNN REGRESSION
6

In [25]:

# Find the optimal number of k

params = {'n_neighbors': range(1,20)}
gscv_model = GridSearchCV(KNeighborsRegressor(), params, cv = 5)

In [26]:

# Fitting the model

gscv_model.fit(x_tr, y_tr)

Out[26]:

GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
param_grid={'n_neighbors': range(1, 20)})

In [27]:

# Best value for k

gscv_model.best_params_

Out[27]:

{'n_neighbors': 16}

In [28]:

# Applying knn regression

knn_reg = KNeighborsRegressor(n_neighbors = 7).fit(x_tr, y_tr)
pred_knn = knn_reg.predict(x_test)

In [29]:

# r2 score
r2_score(y_test, pred_knn)

Out[29]:

0.7484011160531958

In [30]:

# Checking rmse
mse(y_test, pred_knn, squared = False)

Out[30]:

0.07048331688551321

In [ ]:
7

NAME : ARCHANA NAIR

SUBJECT : Algorithm for Data Science
COURSE : M.Sc. Computer Science with Specialization in
Data Science
PRACTICAL 7 (TIME SERIES)
install.packages("timeSeries")
library(timeSeries)
install.packages("forecast")
library(forecast)
data=read.csv('C:/Users/archa/Downloads/daily-total-female-
births.csv')
head(data)

data$Date = as.Date(data$Date)
data['year'] = strftime(data$Date,'%Y')
8

data['mon'] = strftime(data$Date,'%b')
data['day'] = strftime(data$Date,'%d')
data

df_new = subset(data, data$day!=31, c('Births'))

df_new
9

# Plotting the data.

plot(df_new$Births)

# Converting the data to time series format.

data = ts(data = df_new, frequency = 30)
head(df_new)
plot(data)
10

# Print the start, end and frequency

# start
start(data)
# end
end(data)
# frequency
frequency(data)
11

#Decompose the dataset

decomp_data = decompose(data)
plot(decomp_data)

# Explore it’s components.

plot(decomp_data$trend)

plot(decomp_data$random)

plot(decomp_data$seasonal)
13

boxplot(data~cycle(data))

#Remove the seasonality and trend.

stationary_2= data-decomp_data$seasonal-decomp_data$trend
plot(stationary_2)
14

#Apply ARIMA
arima_mod = auto.arima(stationary_2)
f_am = forecast(arima_mod,h = 5, level = c(95))
f_am
plot(f_am)
15

# ARIMA on the original time series data of daily births

arima_mod2 = auto.arima(data)
f_org = forecast(arima_mod2, h= 5, level = c(95))
f_org
plot(f_org)
1

ARCHANA NAIR #M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai
#PRACTICAL 8 - Predicting the authors of the disputed federalist papers

In [41]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

In [42]:

data = pd.read_csv('C:/Users/archa/Downloads/Disputed_Essay_data.csv')
data.head()

Out[42]:

author filename a all also an and any are as ... was w

0 dispt dispt_fed_49.txt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 ... 0.009 0.0

1 dispt dispt_fed_50.txt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 ... 0.051 0.0

2 dispt dispt_fed_51.txt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 ... 0.008 0.0

3 dispt dispt_fed_52.txt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 ... 0.087 0.0

4 dispt dispt_fed_53.txt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 ... 0.027 0.0

5 rows × 72 columns

In [43]:

# Splitting the data

data_test = data[data['author']=='dispt']
data_tr = data[data['author']!='dispt']
2

In [44]:

data_tr.head()

Out[44]:

author filename a all also an and any are as ... w

11 Hamilton Hamilton_fed_1.txt 0.213 0.083 0.000 0.083 0.343 0.056 0.111 0.093 ... 0.0

12 Hamilton Hamilton_fed_11.txt 0.369 0.070 0.006 0.076 0.411 0.023 0.053 0.117 ... 0.0

13 Hamilton Hamilton_fed_12.txt 0.305 0.047 0.007 0.068 0.386 0.047 0.102 0.108 ... 0.0

14 Hamilton Hamilton_fed_13.txt 0.391 0.045 0.015 0.030 0.270 0.045 0.060 0.090 ... 0.0

15 Hamilton Hamilton_fed_15.txt 0.327 0.096 0.000 0.086 0.356 0.014 0.086 0.072 ... 0.0

5 rows × 72 columns

In [45]:

data_test.head()

Out[45]:

author filename a all also an and any are as ... was w

0 dispt dispt_fed_49.txt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 ... 0.009 0.0

1 dispt dispt_fed_50.txt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 ... 0.051 0.0

2 dispt dispt_fed_51.txt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 ... 0.008 0.0

3 dispt dispt_fed_52.txt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 ... 0.087 0.0

4 dispt dispt_fed_53.txt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 ... 0.027 0.0

5 rows × 72 columns
3

In [46]:

# Checking for missing values

data_tr.isna().sum()

Out[46]:

author 0
filename 0
a 0
all 0
also 0
..
who 0
will 0
with 0
would 0
your 0
Length: 72, d type: int64

In [47]:

data_tr['author'].value_counts()

Out[47]:

Hamilton 51
Madison 15
Jay 5
HM 3
Name: author, dtype: int64

In [48]:

# Seperating the features and target

x = data_tr.drop(['author','filename'], axis = 1)
y = data_tr['author']
x.head()

Out[48]:

a all also an and any are as at be ... was were wha

11 0.213 0.083 0.000 0.083 0.343 0.056 0.111 0.093 0.065 0.315 ... 0.000 0.000 0.00

12 0.369 0.070 0.006 0.076 0.411 0.023 0.053 0.117 0.065 0.258 ... 0.000 0.012 0.01

13 0.305 0.047 0.007 0.068 0.386 0.047 0.102 0.108 0.088 0.271 ... 0.000 0.000 0.00

14 0.391 0.045 0.015 0.030 0.270 0.045 0.060 0.090 0.015 0.376 ... 0.000 0.000 0.00

15 0.327 0.096 0.000 0.086 0.356 0.014 0.086 0.072 0.115 0.211 ... 0.014 0.038 0.01

5 rows × 70 columns
4

In [49]:

# Splitting the data for training and testing

x_tr, x_test, y_tr, y_test = train_test_split(x ,y, test_size = 0.2)

In [50]:

# Applying the decision tree

dtree = DecisionTreeClassifier(criterion = 'gini').fit(x_tr, y_tr)

In [51]:

pred_dtree = dtree.predict(x_test)

In [52]:

# Classification Report
print(classification_report(y_test, pred_dtree))

precision recall f1-score support

HM 0.00 0.00 0.00 1

Hamilton 0.91 1.00 0.95 10
Madison 0.75 0.75 0.75 4

accuracy 0.87 15
macro avg 0.55 0.58 0.57 15
weighted avg 0.81 0.87 0.83 15

C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
5/15/22, 4:39 PM 5

In [53]:

# Confusion matrix
cm = confusion_matrix(y_test, pred_dtree)
cm

Out[53]:

array([[ 0, 0, 1],
[ 0, 10, 0],
[ 0, 1, 3]], dtype=int64)

In [54]:

# Accuracy Score
accuracy_score(y_test, pred_dtree)

Out[54]:

0.8666666666666667

In [55]:

# Printing the decision tree

tree.export_text(dtree)

Out[55]:

'|--- feature_59 <= 0.01\n| |--- feature_64 <= 0.11\n| | |--- feature_ 4
<= 0.57\n| | | |--- feature_33 <= 0.02\n| | | | |--- class: HM\n|
| | |--- feature_33 > 0.02\n| | | | |--- class: Hamilto
n\n| ||--- feature_4 > 0.57\n| | | |--- class: Jay\n| |--- fea
ture_64 > 0.11\n| | |--- feature_38 <= 0.74\n| | | |--- class: HM
\n| | |--- feature_38 > 0.74\n| | | |--- class: Madison\n|--- fea
ture_59 > 0.01\n| |--- class: Hamilton\n'
6

In [56]:

0,20))
eature_names = x.columns, class_names = data['author'].value_counts().index, filled = True)
7

In [57]:

# Predcition on test data

data_test = data_test.drop(['author', 'filename'], axis =1)

In [58]:

pred_test = dtree.predict(data_test)
pred_test

Out[58]:

array(['Madison', 'Madison', 'Hamilton', 'Madison', 'Madison', 'Madison',

'Madison', 'Madison', 'Madison', 'Madison', 'Madison'],
dtype=object)

In [59]:

data_test['Predicted Author'] = pred_test

data_test.head()

Out[59]:

a all also an and any are as at be ... were what when

0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.017 0.000 0.009

1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.000 0.000 0.000

2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.015 0.008 0.000

3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.079 0.008 0.024

4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.020 0.020 0.007

5 rows × 71 columns

In [60]:

# Applying KNN Classification

In [61]:

# Scaling the data

scaled_x_tr = StandardScaler().fit_transform(x)
scaled_x_test = StandardScaler().fit_transform(x_test)
8

In [62]:

# Splitting data into training and validation data

x_tr, x_test, yy_tr, yy_test = train_test_split(scaled_x_tr, y, test_size = 0.2)
x_tr.shape, x_test.shape, y_tr.shape, y_test.shape

Out[62]:

((59, 70), (15, 70), (59,), (15,))

In [63]:

# Find the optimal number of k

params = {'n_neighbors':range(1,50)}
gscv_mod = GridSearchCV(KNeighborsClassifier(), params, cv = 10)

In [64]:

gscv_mod.fit(x_tr, y_tr)

C:\Users\verma\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:666: UserWarning: The least populated class in y has only 2 members, which
is less than n_splits=10.
warnings.warn(("The least populated class in y has only %d"

Out[64]:

GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 50)})

In [65]:

# Best k
gscv_mod.best_params_

Out[65]:

{'n_neighbors': 10}

In [66]:

# Applying knn
knn_mod = KNeighborsClassifier(n_neighbors = 7).fit(x_tr, y_tr)

In [67]:

# Predicting
pred_knn = knn_mod.predict(x_test)
pred_knn

Out[67]:

array(['Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton',

'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton',
'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton', 'Hamilton'],
dtype=object)
9

In [68]:

data_test2 = data_test.drop('Predicted Author', axis = 1)

data_test2

Out[68]:

a all also an and any are as at be ... was were wha

0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.009 0.017 0.00

1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.051 0.000 0.00

2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.008 0.015 0.00

3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.087 0.079 0.00

4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.027 0.020 0.02

5 0.245 0.059 0.007 0.067 0.282 0.052 0.111 0.252 0.015 0.297 ... 0.007 0.030 0.01

6 0.349 0.036 0.007 0.029 0.335 0.058 0.087 0.073 0.116 0.378 ... 0.015 0.029 0.01

7 0.414 0.083 0.009 0.018 0.478 0.046 0.110 0.074 0.037 0.331 ... 0.018 0.009 0.00

8 0.248 0.040 0.007 0.040 0.356 0.034 0.154 0.161 0.047 0.289 ... 0.027 0.007 0.02

9 0.442 0.062 0.006 0.075 0.423 0.037 0.093 0.100 0.031 0.379 ... 0.000 0.000 0.02

10 0.276 0.048 0.015 0.082 0.324 0.044 0.058 0.135 0.048 0.290 ... 0.044 0.024 0.00

11 rows × 70 columns

In [69]:

# predicting using the test data

pred_test = knn_mod.predict(data_test2)

In [70]:

data_test2['Pred_Author'] = pred_test
data_test2.head()

Out[70]:

a all also an and any are as at be ... were what when

0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.017 0.000 0.009

1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.000 0.000 0.000

2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.015 0.008 0.000

3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.079 0.008 0.024

4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.020 0.020 0.007

5 rows × 71 columns
k 10

In [71]:

# Confusion matrix
print(confusion_matrix(y_test, pred_knn))

[[ 0 1 0]
[ 0 10 0]
[ 0 4 0]]

In [72]:

accuracy_score(y_test, pred_knn)

Out[72]:

0.6666666666666666

The model created using Decision Tree Classifier gives an accuracy score of 87% whereas the model created
using K Neighbors gives an accuracy score of 67%.

Therefore, from the accuracy we can see that the decision tree performs better in this case.

In [ ]:

10/10
1

ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 9:
Sentiment analysis

In [8]:

import nltk
from wordcloud import WordCloud
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

import pandas as pd
import numpy as np

In [10]:

data = pd.read_csv('C:/Users/archa/Downloads/K8 Reviews v0.2.csv')

data.head()

Out[10]:

sentiment review

0 1 Good but need updates and improvements

1 0 Worst mobile i have bought ever, Battery is dr...

2 1 when I will get my 10% cash back.... its alrea...

3 1 Good

4 0 The worst phone everThey have changed the last...

In [11]:

# Data preprocessing
import string
string.punctuation

Out[11]:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
2

In [13]:

# Tokenization using word_token or RegexpTokenizer

nltk.download('punkt')
words = []
for i in data['review']:
words.append(word_tokenize(i))

[nltk_data] Downloading package punkt to

[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Unzipping tokenizers\punkt.zip.

In [14]:

words

Out[14]:

[['Good', 'but', 'need', 'updates', 'and', 'improvements'],

['Worst',
'mobile',
'i',
'have',
'bought',
'ever',
',',
'Battery',
'is',
'draining',
'like',
'hell',
',',
'backup',
'is',
'only',
'6',

In [15]:

# Removing the digits and punctuations

word_token = []
for i in data['review']:
word_token.append(RegexpTokenizer('[a-z|A-Z]+').tokenize(i))
3

In [16]:

word_token

Out[16]:

[['Good', 'but', 'need', 'updates', 'and', 'improvements'],

['Worst',
'mobile',
'i',
'have',
'bought',
'ever',
'Battery',
'is',
'draining',
'like',
'hell',
'backup',
'is',
'only',
'to',
'hours',
'with',

In [17]:

len(word_token)

Out[17]:

14675

In [20]:

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to

[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\stopwords.zip.

Out[20]:

True
4

In [21]:

# removing stop words

stopwords.words("English")

Out[21]:

['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',
'yourself',
'yourselves',
'he',

In [25]:

word_clean = []
for i in word_token:
word_clean_row = []
for j in i:
if j not in stopwords.words("English"):
word_clean_row.append(j)
word_clean.append(word_clean_row)

In [26]:

word_clean

Out[26]:

[['Good', 'need', 'updates', 'improvements'],

['Worst',
'mobile',
'bought',
'ever',
'Battery',
'draining',
'like',
'hell',
'backup',
'hours',
'internet',
'uses',
'even',
'I',
'put',
'mobile',
'idle',
5

In [29]:

# Converting the text to lower case

word_lower_row = []
word_lower = []
for i in word_clean:
word_lower_row = []
for j in i:
word_lower_row.append(j.lower())
word_lower.append(word_lower_row)

In [30]:

word_lower

Out[30]:

[['good', 'need', 'updates', 'improvements'],

['worst',
'mobile',
'bought',
'ever',
'battery',
'draining',
'like',
'hell',
'backup',
'hours',
'internet',
'uses',
'even',
'i',
'put',
'mobile',
'idle',

In [31]:

# stemming
word_stem_row = []
word_stem = []
for i in word_lower:
word_stem_row = []
for j in i:
word_stem_row.append(PorterStemmer().stem(j))
word_stem.append(word_stem_row)
6

In [32]:

word_stem

Out[32]:

[['good', 'need', 'updat', 'improv'],

['worst',
'mobil',
'bought',
'ever',
'batteri',
'drain',
'like',
'hell',
'backup',
'hour',
'internet',
'use',
'even',
'i',
'put',
'mobil',
'idl',

In [33]:

# Lemmatization (use instead of stemming)

nltk.download('wordnet')
word_lemma_row = []
word_lemma = []
for i in word_lower:
word_lemma_row = []
for j in i:
word_lemma_row.append(WordNetLemmatizer().lemmatize(j))
word_lemma.append(word_lemma_row)

[nltk_data] Downloading package wordnet to

[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\wordnet.zip.
7

In [34]:

word_lemma

Out[34]:

[['good', 'need', 'update', 'improvement'],

['worst',
'mobile',
'bought',
'ever',
'battery',
'draining',
'like',
'hell',
'backup',
'hour',
'internet',
'us',
'even',
'i',
'put',
'mobile',
'idle',

In [37]:

# Apply pos tagging

nltk.download('averaged_perceptron_tagger')
word_tag_row = []
word_tag = []
for i in word_lemma:
word_tag.append(nltk.pos_tag(i))

[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data] C:\Users\archa\AppData\Roaming\nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
8

In [38]:

word_tag

Out[38]:

[[('good', 'JJ'), ('need', 'NN'), ('update', 'JJ'), ('improvement', 'N

N')],
[('worst', 'RB'),
('mobile', 'NN'),
('bought', 'VBD'),
('ever', 'RB'),
('battery', 'RB'),
('draining', 'VBG'),
('like', 'IN'),
('hell', 'NN'),
('backup', 'IN'),
('hour', 'NN'),
('internet', 'NN'),
('us', 'PRP'),
('even', 'RB'),
('i', 'VBP'),
('put', 'VBP'),
('mobile', 'JJ'),

In [40]:

# for understanding the context, filter the nouns

filtered_tag = []
for i in word_tag:
filtered_row = []
for j in i:
if j[1].startswith('NN'):
filtered_row.append(j[0])
filtered_tag.append(filtered_row)

In [41]:

filtered_tag

Out[41]:

[['need', 'improvement'],
['mobile',
'hell',
'hour',
'internet',
'lie',
'amazon',
'lenove',
'battery',
'mah',
'booster',
'charger',
'hour',
'don',
'please',
'regret'],
['cash'],
[],
9

In [44]:

# Generating a word cloud

words = []
for i in filtered_tag:
for j in i:
words.append(j)

In [45]:

words

Out[45]:

['need',
'improvement',
'mobile',
'hell',
'hour',
'internet',
'lie',
'amazon',
'lenove',
'battery',
'mah',
'booster',
'charger',
'hour',
'don',
'please',
'regret',
'cash',

In [46]:

# Convert to string format

word_str = ','.join(words)
word_str

Out[46]:

'need,improvement,mobile,hell,hour,internet,lie,amazon,lenove,battery,mah,
booster,charger,hour,don,please,regret,cash,phone,everthey,phone,problem,p
hone,amazon,i,buyi,batterypoor,camerawaste,money,phone,awesome,heat,allot,
reason,hate,lenovo,k,note,battery,level,worn,problem,phone,problem,lenovo,
k,note,service,station,year,warranty,change,phone,lenovo,lot,glitch,dont,t
hing,option,wrost,phone,charger,damage,month,purchase,item,heating,batter
y,life,i,battery,problem,motherboard,problem,month,life,phone,slim,battry,
backup,screen,love,headset,time,product,range,specification,comparison,ran
ge,i,phone,amazon,seal,i,i,credit,card,i,r,deal,amazon,battery,i,solution,
battery,life,smartphone,galery,problem,atmos,speaker,phone,camera,speed,fe
ature,excelent,battery,product,product,camera,battery,phone,product,lenov
o,option,cast,screen,call,option,doesn,hotspot,phone,usb,cable,phone,pric
e,lenovo,display,specification,function,phone,i,fon,i,speekars,i,phone,iss
ue,color,screen,oreo,battery,heating,problem,phone,battery,update,oreo,si
m,customer,service,performance,battery,get,camera,backup,bestin,pricefull,
passa,wasole,phone,phone,performance,signal,restarts,phone,bcoms,plzz,don
t,buy,round,performance,r,day,trust,deal,amazon,disappointment,problem,hea
dache,problem,call,range,phone,rate,camera,quality,product,mobile,price,fe
10

In [63]:

wordcloud = WordCloud(background_color = 'black', max_words = 10000, contour_width = 3, con

In [64]:

wc = wordcloud.generate(word_str)

In [65]:

import matplotlib.pyplot as plt

plt.figure(figsize = (50,50))
wc.to_image()

Out[65]:

<Figure size 3600x3600 with 0 Axes>

In [66]:

# Change format for COuntVectorizer

str_list = []
for i in filtered_tag:
str_list.append(','.join(i))

10/12
11

In [67]:

str_list

Out[67]:

['need,improvement',
'mobile,hell,hour,internet,lie,amazon,lenove,battery,mah,booster,charger,
hour,don,please,regret',
'cash',
'',
'phone,everthey,phone,problem,phone,amazon',
'i,buyi,batterypoor,camerawaste,money',
'phone,awesome,heat,allot,reason,hate,lenovo,k,note',
'battery,level,worn',
'problem,phone,problem,lenovo,k,note,service,station,year,warranty,chang
e,phone,lenovo',
'lot,glitch,dont,thing,option',
'wrost',
'phone,charger,damage,month',
'purchase,item,heating,battery,life',
'i,battery,problem,motherboard,problem,month,life',
'phone,slim,battry,backup,screen,love',
'headset',

In [68]:

v1 = CountVectorizer().fit(str_list)

In [69]:

v1.get_feature_names()

Out[69]:

['aa',
'aab',
'aachha',
'aaj',
'aajata',
'aal',
'aap',
'aapka',
'aapki',
'aapko',
'aapne',
'aaps',
'aapse',
'aashiyana',
'aata',
'aate',
'aati',
'aavashyakta',

In [70]:

v2 = v1.transform(str_list)
12

In [71]:

v2.toarray()

Out[71]:

array([[0, 0, 0, ..., 0, 0, 0],

[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [72]:

from sklearn.model_selection import train_test_split

x_tr, x_test, y_tr, y_test = train_test_split(v2, data['sentiment'])

In [73]:

# Applying NB
nb_mod = MultinomialNB().fit(x_tr,y_tr)

In [74]:

pred = nb_mod.predict(x_test)
pred

Out[74]:

array([1, 1, 0, ..., 1, 1, 0], dtype=int64)

In [75]:

from sklearn.metrics import accuracy_score

accuracy_score(y_test, pred)

Out[75]:

0.7176342327609703

In [ ]:
1

In [ ]:

# ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science)
#University of Mumbai
#PRACTICAL 10 Implement Regression

In [13]:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pylab

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest,chi2,f_classif,f_regression
from sklearn.metrics import r2_score

In [14]:

df = pd.read_csv("C:/Users/archa/Downloads/Admission_Predict.csv")
df.head()

Out[14]:

Serial GRE TOEFL University Chance of

SOP LOR CGPA Research
No. Score Score Rating Admit

0 1 337 118 4 4.5 4.5 9.65 1 0.92

1 2 324 107 4 4.0 4.5 8.87 1 0.76

2 3 316 104 3 3.0 3.5 8.00 1 0.72

3 4 322 110 3 3.5 2.5 8.67 1 0.80

4 5 314 103 2 2.0 3.0 8.21 0 0.65

In [15]:

df.dtypes

Out[15]:

Serial No. int64

GRE Score int64
TOEFL Score int64
University Rating int64
SOP float64
LOR float64
CGPA float64
Research int64
Chance of Admit float64
dtype: object
2

In [16]:

df.drop('Serial No.',axis=1,inplace=True)

In [17]:

df.describe()

Out[17]:

TOEFL University C
GRE Score SOP LOR CGPA Research
Score Rating

count 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 400.000000 40

mean 316.807500 107.410000 3.087500 3.400000 3.452500 8.598925 0.547500

std 11.473646 6.069514 1.143728 1.006869 0.898478 0.596317 0.498362

min 290.000000 92.000000 1.000000 1.000000 1.000000 6.800000 0.000000

25% 308.000000 103.000000 2.000000 2.500000 3.000000 8.170000 0.000000

50% 317.000000 107.000000 3.000000 3.500000 3.500000 8.610000 1.000000

75% 325.000000 112.000000 4.000000 4.000000 4.000000 9.062500 1.000000

max 340.000000 120.000000 5.000000 5.000000 5.000000 9.920000 1.000000

In [18]:

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True,cmap='tab10')

Out[18]:

<AxesSubplot:>
4

In [19]:

sns.pairplot(df)

Out[19]:

<seaborn.axisgrid.PairGrid at 0x1412c3a4580>
5

In [20]:

#Multiple Linear regression

X = df.drop('Chance of Admit ',axis=1)
Y=df['Chance of Admit ']

In [21]:

x_tr,x_test,y_tr,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

In [22]:

lm1=LinearRegression().fit(x_tr,y_tr)

In [23]:

pred = lm1.predict(x_test)

In [24]:

acc=r2_score(y_test,pred)
acc

Out[24]:

0.7792013613144768

In [25]:

Accuracy_Table = pd.DataFrame({'Model_Name':['Multiple_Linear_Regression'],'Accuracy':[acc]

In [26]:

resid = np.array(pred)-np.array(y_test)
6

In [27]:

import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()

FEATURE SELECTION
In [30]:

X = df.drop('Chance of Admit ',axis=1)

Y=df['Chance of Admit '].values

In [31]:

X_scaled=StandardScaler().fit_transform(X)
7

In [32]:

X_scaled

Out[32]:

array([[ 1.76210664, 1.74697064, 0.79882862, ..., 1.16732114,

1.76481828, 0.90911166],
[ 0.62765641, -0.06763531, 0.79882862, ..., 1.16732114,
0.45515126, 0.90911166],
[-0.07046681, -0.56252785, -0.07660001, ..., 0.05293342,
-1.00563118, 0.90911166],
...,
[ 1.15124883, 1.41704229, 0.79882862, ..., 1.16732114,
1.42900622, 0.90911166],
[-0.41952842, -0.72749202, -0.07660001, ..., 0.61012728,
0.30403584, -1.09997489],
[ 1.41304503, 1.58200646, 0.79882862, ..., 0.61012728,
1.78160888, 0.90911166]])

In [61]:

fs_model=SelectKBest(f_regression,k=3).fit(X_scaled,Y)

In [62]:

fs_model.get_support(indices=True)

Out[62]:

array([0, 1, 5], dtype=int64)

In [63]:

X.iloc[:,[0,1,5]]

Out[63]:

GRE Score TOEFL Score CGPA

0 337 118 9.65

1 324 107 8.87

2 316 104 8.00

3 322 110 8.67

4 314 103 8.21

... ... ... ...

395 324 110 9.04

396 325 107 9.11

397 330 116 9.45

398 312 103 8.78

399 333 117 9.66

400 rows × 3 columns

PCA
In [28]:

pca_mod=PCA(n_components=0.95)

In [33]:

x_pca=pca_mod.fit_transform(X_scaled)

In [34]:

pca_mod.explained_variance_

Out[34]:

array([4.88523075, 0.72669508, 0.54806185, 0.31068105, 0.24582508])

In [35]:

import matplotlib.pyplot as plt

plt.plot(range(1,len(pca_mod.explained_variance_)+1),np.cumsum(pca_mod.explained_variance_)

Out[35]:

[<matplotlib.lines.Line2D at 0x14130040970>]

In [36]:

x_tr,x_test,y_tr,y_test = train_test_split(x_pca,Y,test_size=0.2,random_state=100)

In [37]:

lm3=LinearRegression().fit(x_tr,y_tr)

In [39]:

pred=lm3.predict(x_test)

In [40]:

acc=r2_score(pred,y_test)
acc

Out[40]:

0.7564181483461111
10

In [41]:

Accuracy_Table=Accuracy_Table.append({'Model_Name':'PCA','Accuracy':acc},ignore_index=True)

In [42]:

resid = np.array(pred)-np.array(y_test)

In [43]:

import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()

Lasso Linear Regression

In [45]:

from sklearn.linear_model import Lasso

In [46]:

x_tr,x_test,y_tr,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)

In [47]:

Lasso_lm=Lasso(alpha=0.00001).fit(x_tr,y_tr)
11

In [48]:

pred=Lasso_lm.predict(x_test)

In [49]:

acc=r2_score(pred,y_test)
acc

Out[49]:

0.7626990495351637

In [50]:

resid = np.array(pred)-np.array(y_test)

In [51]:

import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()

In [52]:

Accuracy_Table=Accuracy_Table.append({'Model_Name':'Lasso_Regression','Accuracy':acc},ignor
12

In [53]:

Accuracy_Table

Out[53]:

Model_Name Accuracy

0 Multiple_Linear_Regression 0.779201

1 PCA 0.756418

2 Lasso_Regression 0.762699

Conclusion : Multiple Linear Regression gives the best

accuracy for the dataset
In [54]:

sns.regplot(x='GRE Score',y='Chance of Admit ',data=df)

Out[54]:

<AxesSubplot:xlabel='GRE Score', ylabel='Chance of Admit '>

In [55]:

sns.regplot(x='TOEFL Score',y='Chance of Admit ',data=df)

Out[55]:

<AxesSubplot:xlabel='TOEFL Score', ylabel='Chance of Admit '>

5/21/22, 4:39 PM 14

In [56]:

sns.regplot(x='SOP',y='Chance of Admit ',data=df)

Out[56]:

<AxesSubplot:xlabel='SOP', ylabel='Chance of Admit '>

In [57]:

sns.regplot(x='LOR ',y='Chance of Admit ',data=df)

Out[57]:

<AxesSubplot:xlabel='LOR ', ylabel='Chance of Admit '>

In [58]:

sns.regplot(x='CGPA',y='Chance of Admit ',data=df)

Out[58]:

<AxesSubplot:xlabel='CGPA', ylabel='Chance of Admit '>

In [59]:

sns.regplot(x='University Rating',y='Chance of Admit ',data=df)

Out[59]:

<AxesSubplot:xlabel='University Rating', ylabel='Chance of Admit '>

In [60]:

sns.regplot(x='Research',y='Chance of Admit ',data=df)

Out[60]:

<AxesSubplot:xlabel='Research', ylabel='Chance of Admit '>

In [ ]:

Computational Thinking A Primer For Programmers and Data Scientists G Venkatesh Madhavan Mukund
No ratings yet
Computational Thinking A Primer For Programmers and Data Scientists G Venkatesh Madhavan Mukund
187 pages
Problems On Relational Algebra
No ratings yet
Problems On Relational Algebra
12 pages
Lab File Format
No ratings yet
Lab File Format
60 pages
DMC - Record
No ratings yet
DMC - Record
54 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
Minor Research Project Report
No ratings yet
Minor Research Project Report
23 pages
Hands On Mahout - Mammoth Scale Machine Learning Presentation
No ratings yet
Hands On Mahout - Mammoth Scale Machine Learning Presentation
68 pages
Information Retrieval Journal
No ratings yet
Information Retrieval Journal
33 pages
CS442 DSA Practical File
No ratings yet
CS442 DSA Practical File
60 pages
I. The K-Means Clustering Method
No ratings yet
I. The K-Means Clustering Method
17 pages
BDA Assignment No-2 B-2 47
No ratings yet
BDA Assignment No-2 B-2 47
14 pages
Data Science Course Syllabus
No ratings yet
Data Science Course Syllabus
13 pages
DM Lab Internal
No ratings yet
DM Lab Internal
37 pages
Adobe Scan 25 Nov 2023
No ratings yet
Adobe Scan 25 Nov 2023
17 pages
Bda Record (24-25)
No ratings yet
Bda Record (24-25)
50 pages
CS3491 AIML LAB Record 2023-2024
No ratings yet
CS3491 AIML LAB Record 2023-2024
51 pages
Prachi 20CS111 BDALab File
No ratings yet
Prachi 20CS111 BDALab File
20 pages
Lecture 4
No ratings yet
Lecture 4
56 pages
Lab Sneha
No ratings yet
Lab Sneha
20 pages
Vishnu. ML
No ratings yet
Vishnu. ML
26 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Bdaa
No ratings yet
Bdaa
6 pages
Clustering For Clasification
No ratings yet
Clustering For Clasification
13 pages
CA01
No ratings yet
CA01
14 pages
Data Science & AIML Coursework
No ratings yet
Data Science & AIML Coursework
10 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
22-23-III-DWM-UT2 With Solution
No ratings yet
22-23-III-DWM-UT2 With Solution
8 pages
DW Ans
No ratings yet
DW Ans
19 pages
DWDM Answer
No ratings yet
DWDM Answer
19 pages
Lab Manual Final
No ratings yet
Lab Manual Final
34 pages
BDA University Question Paper
No ratings yet
BDA University Question Paper
10 pages
BDA Journal
No ratings yet
BDA Journal
52 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
AIML Lab Manual Final
No ratings yet
AIML Lab Manual Final
43 pages
Ai Journal
No ratings yet
Ai Journal
33 pages
21BCS2928 - Ramandeep - Day15 Java
No ratings yet
21BCS2928 - Ramandeep - Day15 Java
3 pages
Fallsem2024-25 Bcse331l TH VL2024250101742 Cat-2-Qp - Key
No ratings yet
Fallsem2024-25 Bcse331l TH VL2024250101742 Cat-2-Qp - Key
5 pages
DM Guidelines 14jan2022
No ratings yet
DM Guidelines 14jan2022
5 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Guidelines Data mining-II BA Major Sem 4 NEP
No ratings yet
Guidelines Data mining-II BA Major Sem 4 NEP
2 pages
Machine Learning Techniques LAB FILE - KAI651
No ratings yet
Machine Learning Techniques LAB FILE - KAI651
16 pages
DSA DSA: Questions
No ratings yet
DSA DSA: Questions
36 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
43 pages
Lab - Aiml Manual
No ratings yet
Lab - Aiml Manual
36 pages
BTCS 1st IMLP - Assignment 3 - 4 - 5
No ratings yet
BTCS 1st IMLP - Assignment 3 - 4 - 5
3 pages
BSC (H) Computer Science Discipline Specific Elective-Data Mining-Ii (Guidelines) Sem V (July 2024 Onwards)
No ratings yet
BSC (H) Computer Science Discipline Specific Elective-Data Mining-Ii (Guidelines) Sem V (July 2024 Onwards)
2 pages
HW2 Statement
No ratings yet
HW2 Statement
3 pages
Ca2 Dsa 24
No ratings yet
Ca2 Dsa 24
2 pages
Guidelines Datamining II
No ratings yet
Guidelines Datamining II
2 pages
Unit-5 - Question Bank
No ratings yet
Unit-5 - Question Bank
5 pages
Guidelines-Datamining-I - UGCF-BA-major-sem 3 - July 24
No ratings yet
Guidelines-Datamining-I - UGCF-BA-major-sem 3 - July 24
3 pages
new-Guidelines-Datamining-I-UGCF-DSE-CS Hons-Sem 4-Jan 25
No ratings yet
new-Guidelines-Datamining-I-UGCF-DSE-CS Hons-Sem 4-Jan 25
3 pages
ML Record
No ratings yet
ML Record
19 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Guidelines-Datamining-I-UGCF-DSE-CS Hons-Sem 4-Jan2024
No ratings yet
Guidelines-Datamining-I-UGCF-DSE-CS Hons-Sem 4-Jan2024
3 pages
BDA Assignment 2
No ratings yet
BDA Assignment 2
5 pages
M L
No ratings yet
M L
13 pages
Guidelines Datamining I
No ratings yet
Guidelines Datamining I
3 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
SQL Server Architecture - PPT
No ratings yet
SQL Server Architecture - PPT
20 pages
Wireshark Slides PDF
No ratings yet
Wireshark Slides PDF
11 pages
Part B Questions
No ratings yet
Part B Questions
3 pages
Predicting Employee Retention Report
No ratings yet
Predicting Employee Retention Report
14 pages
Thesis Presentation of Data
100% (3)
Thesis Presentation of Data
6 pages
MT6765 Android Scatter
No ratings yet
MT6765 Android Scatter
14 pages
Key Elements of A QRIS Validation Plan:: Guidance and Planning Template
No ratings yet
Key Elements of A QRIS Validation Plan:: Guidance and Planning Template
22 pages
Exploratory Data Mining and Data Cleansing PDF
0% (1)
Exploratory Data Mining and Data Cleansing PDF
2 pages
Introduction To Educational Research Sem 6th in Eng
No ratings yet
Introduction To Educational Research Sem 6th in Eng
61 pages
Hadoop All Installations
No ratings yet
Hadoop All Installations
19 pages
Database II Practical 2021-2022
No ratings yet
Database II Practical 2021-2022
12 pages
Module 3 MongoDB
No ratings yet
Module 3 MongoDB
8 pages
Module:-11. Day56,57,58
No ratings yet
Module:-11. Day56,57,58
17 pages
L14. Singly and Doubly Linked List
No ratings yet
L14. Singly and Doubly Linked List
57 pages
BDC Output 7
No ratings yet
BDC Output 7
9 pages
Marketing Research Assignment
No ratings yet
Marketing Research Assignment
8 pages
Rdbms Unit 4 Notes
No ratings yet
Rdbms Unit 4 Notes
19 pages
Roles of AI and Simulation For Military Decision Making
No ratings yet
Roles of AI and Simulation For Military Decision Making
10 pages
Management Information Systems Managing The Digital Firm 17ed-243-274.en - Id
No ratings yet
Management Information Systems Managing The Digital Firm 17ed-243-274.en - Id
32 pages
Danti M Marina English Seminar Final Exam
No ratings yet
Danti M Marina English Seminar Final Exam
8 pages
CESS Workshop On SPSS STATA and Qualitative Data Analysis
No ratings yet
CESS Workshop On SPSS STATA and Qualitative Data Analysis
2 pages
2017TJS53
No ratings yet
2017TJS53
8 pages
DatabaseDesign Asm2
No ratings yet
DatabaseDesign Asm2
26 pages
CGMB 234: Multimedia Systems Design
No ratings yet
CGMB 234: Multimedia Systems Design
33 pages
SAP R3 Sysadminstartor
No ratings yet
SAP R3 Sysadminstartor
56 pages
Ms Access Lab
No ratings yet
Ms Access Lab
5 pages
IsarFlow CGN QA 2018 Q2 en
No ratings yet
IsarFlow CGN QA 2018 Q2 en
8 pages
Wzzip Reference
No ratings yet
Wzzip Reference
9 pages
Broad Problem Area
No ratings yet
Broad Problem Area
9 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet