Algorithm of Data Science Journal
Algorithm of Data Science Journal
UNIVERSITY OF MUMBAI
DEPARTMENT OF COMPUTER SCIENCE
Seat No.
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
2
UNIVERSITY OF MUMBAI
DEPARTMENT OF COMPUTER SCIENCE
CERTIFICATE
This is to certify that the work entered in this journal was done in the University
Department of Computer Science laboratory by
Mr./Ms. ARCHANA SUKUMARAN NAIR, Seat No.
for the course of M.Sc. Computer Science with Spl. in Data Science - Semester
II (CBCS) (Revised) during the academic year 2021-2022 in a satisfactory
manner.
External Examiner
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
3
Index
1 HADOOP HDFS 4
5 KNN CLASSIFIER 36
6 KNN REGRESSOR 46
7 TIME SERIES 53
8 PREDICTONG AUTHORS OF 62
DISPUTED FEDERALIST PAPER
9 SENTIMENT ANALYSIS 72
10 REGRESSION ANALYSIS 84
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
4
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
5
Hadoop version command gives us the information about which version of hadoop
is currently installed in the system.
Command JPS shows which nodes are currently running for the hadoop file system.
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
6
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
7
Command appendToFile is used to add content of file which is in local system to the
hadoop file system.
fs -ls -R / command show all the directories and sub-directories of the hadoop file
system.
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
8
Fs -du -h / command shows each of the file sizes of hadoop file system.
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
9
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
10
hadoop start-yarn.cmd
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
11
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
12
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
14
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
15
MSc. Computer Science with specialization in DATA SCIENCE- SEM II ALGORITHMS OF DATA SCIENCE 2021-2022
16
ARCHANA NAIR
M.Sc. Computer Science (With Specialization in Data Science) University of Mumbai PRACTICAL 3 :- K-Means
Clustering
In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
In [22]:
data = pd.read_csv("C:/Users/archa/Downloads/Mall_Customers.csv")
data.head()
Out[22]:
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
In [4]:
Out[4]:
0 Male 19 15 39
1 Male 21 15 81
2 Female 20 16 6
3 Female 23 16 77
4 Female 31 17 40
17
In [23]:
data.dtypes
Out[23]:
CustomerID int64
Genre object
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object
In [24]:
Out[24]:
CustomerID 0
Genre 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
In [ ]:
18
In [5]:
sns.pairplot(data)
Out[5]:
<seaborn.axisgrid.PairGrid at 0x20172d70730>
19
In [6]:
data.columns
Out[6]:
In [7]:
In [10]:
Out[10]:
0 1 19 15 39
1 1 21 15 81
2 2 20 16 6
3 2 23 16 77
4 2 31 17 40
In [11]:
data.dtypes
Out[11]:
Genre int64
Age int64
Annual Income (k$) int64
Spending Score (1-100) int64
dtype: object
In [26]:
#scaling transformation
#1.zscore normalization using standardScalar(Same mean)
#2.Minmax normalization using MinMax Scaler(0 to 1)
22
In [12]:
df_customer = data.iloc[:,2:4]
df_customer.head()
Out[12]:
0 15 39
1 15 81
2 16 6
3 16 77
4 17 40
In [13]:
Out[13]:
array([[-1.73899919, -0.43480148],
[-1.73899919, 1.19570407],
[-1.70082976, -1.71591298],
[-1.70082976, 1.04041783],
[-1.66266033, -0.39597992],
[-1.66266033, 1.00159627],
[-1.62449091, -1.71591298],
[-1.62449091, 1.70038436],
[-1.58632148, -1.83237767],
[-1.58632148, 0.84631002],
[-1.58632148, -1.4053405 ],
[-1.58632148, 1.89449216],
[-1.54815205, -1.36651894],
[-1.54815205, 1.04041783],
[-1.54815205, -1.44416206],
[-1.54815205, 1.11806095],
[-1.50998262, -0.59008772],
[-1.50998262, 0.61338066],
In [14]:
In [15]:
wcss
Out[15]:
[269.01679374906655,
157.70400815035939,
108.92131661364358,
65.56840815571681,
55.103778121150555,
44.86475569922555,
37.24321153347672,
33.85792110528426,
30.684270071530346]
In [16]:
#Plotting
plt.figure(figsize = (8,6), dpi=100)
plt.plot(range(2,11),wcss, marker = 'o', c='blue', markerfacecolor='red')
plt.xlabel('No of Clusters')
plt.ylabel('WCSS')
plt.show()
In [17]:
In [18]:
cl = Kmodel_final.predict(data_scaled)
In [19]:
cl
Out[19]:
array([0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3,
0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 3, 0, 1,
0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 2, 1, 2, 4, 2, 4, 2,
1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 1, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2, 4, 2,
4, 2])
In [20]:
Out[20]:
0 15 39 0
1 15 81 3
2 16 6 0
3 16 77 3
4 17 40 0
25
In [27]:
# Visualization of clusters
plt.figure(figsize = (6,4), dpi = 100)
plt.scatter(x=df_customer['Annual Income (k$)'],y=df_customer['Spending Score (1-100)'],c=c
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score')
plt.show()
cl 1=high income low spender c2 = high income high spender c3 = low income high spender c4 = low income
low spender c5 = moderate income moderate spender
Conclusion
Mall customer data is clustered into 5 clusters. The green cluster indicates the people who have high spending
score but a low annual income. The purple cluster shows the people who have a low annual income & low
M 26
spending score. The blue cluster shows people who have an average annual income & average spending
score. The sea green cluster indicate the people who have high annual income & high spending score. The
yellow cluster shows the people who have low spending score & high annual income.
In [ ]:
1
ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 4 :-
Heirarchical Clustering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import cdist
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
In [3]:
data = pd.read_csv('C:/Users/archa/Downloads/USArrests.csv')
data.head()
Out[3]:
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 50 non-null object
1 Murder 50 non-null float64
2 Assault 50 non-null int64
3 UrbanPop 50 non-null int64
4 Rape 50 non-null float64
dtypes: float64(2), int64(2), object(1)
memory usage: 2.1+ KB
2
In [6]:
data.isna().sum()
Out[6]:
Unnamed: 0 0
Murder 0
Assault 0
UrbanPop 0
Rape 0
dtype: int64
In [7]:
data.shape
Out[7]:
(50, 5)
3
In [8]:
data['Unnamed: 0'].value_counts()
Out[8]:
Georgia 1
Nevada 1
Maryland 1
Hawaii 1
North Dakota 1
Nebraska 1
Rhode Island 1
Missouri 1
Oregon 1
Virginia 1
North Carolina 1
New Hampshire 1
Indiana 1
Idaho 1
New Mexico 1
Florida 1
Oklahoma 1
Arizona 1
Delaware 1
New Jersey 1
Montana 1
Colorado 1
Illinois 1
Vermont 1
Tennessee 1
Arkansas 1
Kansas 1
Ohio 1
Massachusetts 1
South Dakota 1
Louisiana 1
Kentucky 1
Utah 1
Minnesota 1
Alabama 1
West Virginia 1
Washington 1
Pennsylvania 1
Wisconsin 1
Connecticut 1
Texas 1
Wyoming 1
Mississippi 1
California 1
Michigan 1
Iowa 1
South Carolina 1
New York 1
Maine 1
Alaska 1
Name: Unnamed: 0, dtype: int64
4
In [16]:
data.describe()
Out[16]:
In [17]:
Out[17]:
In [18]:
In [34]:
In [19]:
# Converting the States column to numeric using get_dummies from pandas library
data_num = pd.get_dummies(data, columns = ['States'])
data_num.head()
Out[19]:
5 rows × 54 columns
20
Out[20]:
In [23]:
k = 3
h_cluster = AgglomerativeClustering(n_clusters = k, affinity = 'euclidean', linkage = 'ward
h_cluster.fit(data_num)
Out[23]:
AgglomerativeClustering(n_clusters=3)
In [24]:
cluster = h_cluster.fit_predict(data_num)
In [26]:
# Silhouette Score
print('Silhouette Score: %0.3f' % metrics.silhouette_score(data_num,h_cluster.labels_))
K MEANS CLUSTERING
M 8
In [27]:
In [28]:
wcss
Out[28]:
[96447.02814449916,
48011.26535714287,
34774.629357142854,
29124.6065,
19001.82888888889,
16633.143809523808,
13960.15160714286]
In [29]:
# Elbow plot
plt.figure(figsize = (6,4), dpi = 100)
plt.plot(range(2,9), wcss, marker = 'o', c = 'blue')
plt.xlabel('No of clusters')
plt.ylabel('WCSS')
plt.show()
9
In [30]:
# Final model with 5 clusters
kmod_final = KMeans(n_clusters = 5, init = 'k-means++').fit(data_num)
cl = kmod_final.predict(data_num)
cl
Out[30]:
array([0, 0, 4, 3, 0, 3, 1, 0, 4, 3, 2, 1, 0, 1, 2, 1, 1, 0, 2, 4, 3, 0,
2, 0, 3, 1, 1, 0, 2, 3, 0, 0, 4, 2, 1, 3, 3, 1, 3, 0, 2, 3, 3, 1,
2, 3, 3, 2, 2, 3])
In [31]:
# Silhouette Score
from sklearn.metrics import silhouette_score
score = silhouette_score(data_num, kmod_final.labels_, metric='euclidean')
In [32]:
score
Out[32]:
0.4486735234754001
In [ ]:
1
ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 5 :-
Implement KNN & PCA
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
In [3]:
titan_train = pd.read_csv("C:/Users/archa/Downloads/titan_train.csv")
titan_train.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833
(Florence
Briggs
Th...
Heikkinen, STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9250
3101282
Laina
Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1000
Heath
(Lily May
Peel)
Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0500
Henry
2
In [5]:
titan_test = pd.read_csv("C:/Users/archa/Downloads/titan_test.csv")
titan_test.head()
Out[5]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Emba
Kelly, Mr.
0 892 3
male 34.5 0 0 330911 7.8292 NaN
James
Wilkes,
Mrs.
1 893 3 James female 47.0 1 0 363272 7.0000 NaN
(Ellen
Needs)
Myles, Mr.
2 894 2 Thomas male 62.0 0 0 240276 9.6875 NaN
Francis
Wirz, Mr.
3 895 3
male 27.0 0 0 315154 8.6625 NaN
Albert
Hirvonen,
Mrs.
4 896 3 Alexander female 22.0 1 1 3101298 12.2875 NaN
(Helga E
Lindqvist)
In [6]:
titan_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
3
In [7]:
y_train = titan_train['Survived']
Out[7]:
0 549
1 342
Name: Survived, dtype: int64
In [8]:
Out[8]:
0.7710437710437711
In [9]:
Out[9]:
In [10]:
# Dealing with missing values in the training set and test set
titan_train['Embarked'] = titan_train['Embarked'].fillna('S')
titan_train['Age'] = titan_train['Age'].fillna(titan_train['Age'].mean())
titan_test['Embarked'] = titan_test['Embarked'].fillna('S')
titan_test['Age'] = titan_test['Age'].fillna(titan_test['Age'].mean())
titan_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null object
3 Age 891 non-null float64
4 SibSp 891 non-null int64
5 Parch 891 non-null int64
6 Fare 891 non-null float64
7 Embarked 891 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB
In [12]:
Out[12]:
Survived Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embar
0 0 3 22.0 1 0 7.2500 0 1 0
1 1 1 38.0 1 0 71.2833 1 0 1
2 1 3 26.0 0 0 7.9250 1 0 0
3 1 1 35.0 1 0 53.1000 1 0 0
4 0 3 35.0 0 0 8.0500 0 1 0
5
In [13]:
titan_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Age 418 non-null float64
2 SibSp 418 non-null int64
3 Parch 418 non-null int64
4 Fare 417 non-null float64
5 Sex_female 418 non-null uint8
6 Sex_male 418 non-null uint8
7 Embarked_C 418 non-null uint8
8 Embarked_Q 418 non-null uint8
9 Embarked_S 418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.5 KB
In [14]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 418 non-null int64
1 Age 418 non-null float64
2 SibSp 418 non-null int64
3 Parch 418 non-null int64
4 Fare 418 non-null float64
5 Sex_female 418 non-null uint8
6 Sex_male 418 non-null uint8
7 Embarked_C 418 non-null uint8
8 Embarked_Q 418 non-null uint8
9 Embarked_S 418 non-null uint8
dtypes: float64(2), int64(3), uint8(5)
memory usage: 18.5 KB
In [15]:
y_train.shape
Out[15]:
(891,)
6
In [17]:
In [19]:
# Applying PCA
pca_model = PCA(n_components = 0.95)
pca_scaled_titan = pca_model.fit_transform(scaled_titan_train)
pca_model.explained_variance_ratio_
Out[19]:
In [20]:
In [21]:
pca_model.n_components_
Out[21]:
In [23]:
Out[23]:
In [24]:
In [27]:
Out[27]:
GridSearchCV(estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 10)})
In [28]:
# best value of k
gscv_model.best_params_
Out[28]:
{'n_neighbors': 1}
In [29]:
# Applying KNN
knn_model = KNeighborsClassifier(n_neighbors = 1).fit(titan_train, val_train)
pred = knn_model.predict(x_test)
8
In [30]:
# confusion matrix
cm = confusion_matrix(val_test, pred)
print(cm)
[[165 1]
[ 0 102]]
In [31]:
# Classification report
print(classification_report(val_test, pred))
In [32]:
Out[32]:
(418, 10)
In [33]:
In [34]:
pca_test = pca_mod.fit_transform(scaled_titan_test)
In [35]:
pca_mod.explained_variance_ratio_
Out[35]:
In [36]:
plt.figure(figsize = (8,7))
plt.plot(np.cumsum(pca_mod.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show()
In [37]:
pca_mod.n_components_
Out[37]:
8
10
In [39]:
pred2 = knn_model.predict(pca_test)
pred2
Out[39]:
array([0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1,
0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0,
1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0,
1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
dtype=int64)
In [ ]:
b 10/10
1
ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 6 :-
Implement KNN-REGRESSOR
In [2]:
In [2]:
Out[2]:
In [3]:
df_data.shape
Out[3]:
(400, 9)
2
In [8]:
df_data.columns
Out[8]:
In [6]:
df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Serial No. 400 non-null int64
1 GRE Score 400 non-null int64
2 TOEFL Score 400 non-null int64
3 University Rating 400 non-null int64
4 SOP 400 non-null float64
5 LOR 400 non-null float64
6 CGPA 400 non-null float64
7 Research 400 non-null int64
8 Chance of Admit 400 non-null float64
dtypes: float64(4), int64(5)
memory usage: 28.2 KB
In [10]:
In [12]:
In [16]:
Out[16]:
In [18]:
Out[18]:
In [19]:
In [20]:
Out[20]:
In [21]:
In [22]:
pred_1 = lr_model.predict(x_test)
pred_1
Out[22]:
In [23]:
Out[23]:
0.7693406066446685
In [24]:
Out[24]:
0.0674866003456461
KNN REGRESSION
6
In [25]:
In [26]:
Out[26]:
GridSearchCV(cv=5, estimator=KNeighborsRegressor(),
param_grid={'n_neighbors': range(1, 20)})
In [27]:
Out[27]:
{'n_neighbors': 16}
In [28]:
In [29]:
# r2 score
r2_score(y_test, pred_knn)
Out[29]:
0.7484011160531958
In [30]:
# Checking rmse
mse(y_test, pred_knn, squared = False)
Out[30]:
0.07048331688551321
In [ ]:
7
data$Date = as.Date(data$Date)
data['year'] = strftime(data$Date,'%Y')
8
data['mon'] = strftime(data$Date,'%b')
data['day'] = strftime(data$Date,'%d')
data
plot(decomp_data$trend)
plot(decomp_data$random)
plot(decomp_data$seasonal)
13
boxplot(data~cycle(data))
#Apply ARIMA
arima_mod = auto.arima(stationary_2)
f_am = forecast(arima_mod,h = 5, level = c(95))
f_am
plot(f_am)
15
ARCHANA NAIR #M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai
#PRACTICAL 8 - Predicting the authors of the disputed federalist papers
In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
In [42]:
data = pd.read_csv('C:/Users/archa/Downloads/Disputed_Essay_data.csv')
data.head()
Out[42]:
0 dispt dispt_fed_49.txt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 ... 0.009 0.0
1 dispt dispt_fed_50.txt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 ... 0.051 0.0
2 dispt dispt_fed_51.txt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 ... 0.008 0.0
3 dispt dispt_fed_52.txt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 ... 0.087 0.0
4 dispt dispt_fed_53.txt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 ... 0.027 0.0
5 rows × 72 columns
In [43]:
In [44]:
data_tr.head()
Out[44]:
11 Hamilton Hamilton_fed_1.txt 0.213 0.083 0.000 0.083 0.343 0.056 0.111 0.093 ... 0.0
12 Hamilton Hamilton_fed_11.txt 0.369 0.070 0.006 0.076 0.411 0.023 0.053 0.117 ... 0.0
13 Hamilton Hamilton_fed_12.txt 0.305 0.047 0.007 0.068 0.386 0.047 0.102 0.108 ... 0.0
14 Hamilton Hamilton_fed_13.txt 0.391 0.045 0.015 0.030 0.270 0.045 0.060 0.090 ... 0.0
15 Hamilton Hamilton_fed_15.txt 0.327 0.096 0.000 0.086 0.356 0.014 0.086 0.072 ... 0.0
5 rows × 72 columns
In [45]:
data_test.head()
Out[45]:
0 dispt dispt_fed_49.txt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 ... 0.009 0.0
1 dispt dispt_fed_50.txt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 ... 0.051 0.0
2 dispt dispt_fed_51.txt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 ... 0.008 0.0
3 dispt dispt_fed_52.txt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 ... 0.087 0.0
4 dispt dispt_fed_53.txt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 ... 0.027 0.0
5 rows × 72 columns
3
In [46]:
Out[46]:
author 0
filename 0
a 0
all 0
also 0
..
who 0
will 0
with 0
would 0
your 0
Length: 72, d type: int64
In [47]:
data_tr['author'].value_counts()
Out[47]:
Hamilton 51
Madison 15
Jay 5
HM 3
Name: author, dtype: int64
In [48]:
Out[48]:
11 0.213 0.083 0.000 0.083 0.343 0.056 0.111 0.093 0.065 0.315 ... 0.000 0.000 0.00
12 0.369 0.070 0.006 0.076 0.411 0.023 0.053 0.117 0.065 0.258 ... 0.000 0.012 0.01
13 0.305 0.047 0.007 0.068 0.386 0.047 0.102 0.108 0.088 0.271 ... 0.000 0.000 0.00
14 0.391 0.045 0.015 0.030 0.270 0.045 0.060 0.090 0.015 0.376 ... 0.000 0.000 0.00
15 0.327 0.096 0.000 0.086 0.356 0.014 0.086 0.072 0.115 0.211 ... 0.014 0.038 0.01
5 rows × 70 columns
4
In [49]:
In [50]:
In [51]:
pred_dtree = dtree.predict(x_test)
In [52]:
# Classification Report
print(classification_report(y_test, pred_dtree))
accuracy 0.87 15
macro avg 0.55 0.58 0.57 15
weighted avg 0.81 0.87 0.83 15
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\verma\anaconda3\lib\site-packages\sklearn\metrics\_classification.p
y:1245: UndefinedMetricWarning: Precision and F-score are ill-defined and be
ing set to 0.0 in labels with no predicted samples. Use `zero_division` para
meter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
5/15/22, 4:39 PM 5
In [53]:
# Confusion matrix
cm = confusion_matrix(y_test, pred_dtree)
cm
Out[53]:
array([[ 0, 0, 1],
[ 0, 10, 0],
[ 0, 1, 3]], dtype=int64)
In [54]:
# Accuracy Score
accuracy_score(y_test, pred_dtree)
Out[54]:
0.8666666666666667
In [55]:
Out[55]:
'|--- feature_59 <= 0.01\n| |--- feature_64 <= 0.11\n| | |--- feature_ 4
<= 0.57\n| | | |--- feature_33 <= 0.02\n| | | | |--- class: HM\n|
| | |--- feature_33 > 0.02\n| | | | |--- class: Hamilto
n\n| ||--- feature_4 > 0.57\n| | | |--- class: Jay\n| |--- fea
ture_64 > 0.11\n| | |--- feature_38 <= 0.74\n| | | |--- class: HM
\n| | |--- feature_38 > 0.74\n| | | |--- class: Madison\n|--- fea
ture_59 > 0.01\n| |--- class: Hamilton\n'
6
In [56]:
0,20))
eature_names = x.columns, class_names = data['author'].value_counts().index, filled = True)
7
In [57]:
In [58]:
pred_test = dtree.predict(data_test)
pred_test
Out[58]:
In [59]:
Out[59]:
0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.017 0.000 0.009
1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.000 0.000 0.000
2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.015 0.008 0.000
3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.079 0.008 0.024
4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.020 0.020 0.007
5 rows × 71 columns
In [60]:
In [61]:
In [62]:
Out[62]:
In [63]:
In [64]:
gscv_mod.fit(x_tr, y_tr)
C:\Users\verma\anaconda3\lib\site-packages\sklearn\model_selection\_split.p
y:666: UserWarning: The least populated class in y has only 2 members, which
is less than n_splits=10.
warnings.warn(("The least populated class in y has only %d"
Out[64]:
GridSearchCV(cv=10, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': range(1, 50)})
In [65]:
# Best k
gscv_mod.best_params_
Out[65]:
{'n_neighbors': 10}
In [66]:
# Applying knn
knn_mod = KNeighborsClassifier(n_neighbors = 7).fit(x_tr, y_tr)
In [67]:
# Predicting
pred_knn = knn_mod.predict(x_test)
pred_knn
Out[67]:
In [68]:
Out[68]:
0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.009 0.017 0.00
1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.051 0.000 0.00
2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.008 0.015 0.00
3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.087 0.079 0.00
4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.027 0.020 0.02
5 0.245 0.059 0.007 0.067 0.282 0.052 0.111 0.252 0.015 0.297 ... 0.007 0.030 0.01
6 0.349 0.036 0.007 0.029 0.335 0.058 0.087 0.073 0.116 0.378 ... 0.015 0.029 0.01
7 0.414 0.083 0.009 0.018 0.478 0.046 0.110 0.074 0.037 0.331 ... 0.018 0.009 0.00
8 0.248 0.040 0.007 0.040 0.356 0.034 0.154 0.161 0.047 0.289 ... 0.027 0.007 0.02
9 0.442 0.062 0.006 0.075 0.423 0.037 0.093 0.100 0.031 0.379 ... 0.000 0.000 0.02
10 0.276 0.048 0.015 0.082 0.324 0.044 0.058 0.135 0.048 0.290 ... 0.044 0.024 0.00
11 rows × 70 columns
In [69]:
In [70]:
data_test2['Pred_Author'] = pred_test
data_test2.head()
Out[70]:
0 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411 ... 0.017 0.000 0.009
1 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393 ... 0.000 0.000 0.000
2 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474 ... 0.015 0.008 0.000
3 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365 ... 0.079 0.008 0.024
4 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344 ... 0.020 0.020 0.007
5 rows × 71 columns
k 10
In [71]:
# Confusion matrix
print(confusion_matrix(y_test, pred_knn))
[[ 0 1 0]
[ 0 10 0]
[ 0 4 0]]
In [72]:
accuracy_score(y_test, pred_knn)
Out[72]:
0.6666666666666666
The model created using Decision Tree Classifier gives an accuracy score of 87% whereas the model created
using K Neighbors gives an accuracy score of 67%.
Therefore, from the accuracy we can see that the decision tree performs better in this case.
In [ ]:
10/10
1
ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science) #University of Mumbai #PRACTICAL 9:
Sentiment analysis
In [8]:
import nltk
from wordcloud import WordCloud
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
In [10]:
Out[10]:
sentiment review
3 1 Good
In [11]:
# Data preprocessing
import string
string.punctuation
Out[11]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
2
In [13]:
In [14]:
words
Out[14]:
In [15]:
In [16]:
word_token
Out[16]:
In [17]:
len(word_token)
Out[17]:
14675
In [20]:
nltk.download('stopwords')
Out[20]:
True
4
In [21]:
Out[21]:
['i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
"you're",
"you've",
"you'll",
"you'd",
'your',
'yours',
'yourself',
'yourselves',
'he',
In [25]:
word_clean = []
for i in word_token:
word_clean_row = []
for j in i:
if j not in stopwords.words("English"):
word_clean_row.append(j)
word_clean.append(word_clean_row)
In [26]:
word_clean
Out[26]:
In [29]:
In [30]:
word_lower
Out[30]:
In [31]:
# stemming
word_stem_row = []
word_stem = []
for i in word_lower:
word_stem_row = []
for j in i:
word_stem_row.append(PorterStemmer().stem(j))
word_stem.append(word_stem_row)
6
In [32]:
word_stem
Out[32]:
In [33]:
In [34]:
word_lemma
Out[34]:
In [37]:
In [38]:
word_tag
Out[38]:
In [40]:
In [41]:
filtered_tag
Out[41]:
[['need', 'improvement'],
['mobile',
'hell',
'hour',
'internet',
'lie',
'amazon',
'lenove',
'battery',
'mah',
'booster',
'charger',
'hour',
'don',
'please',
'regret'],
['cash'],
[],
9
In [44]:
In [45]:
words
Out[45]:
['need',
'improvement',
'mobile',
'hell',
'hour',
'internet',
'lie',
'amazon',
'lenove',
'battery',
'mah',
'booster',
'charger',
'hour',
'don',
'please',
'regret',
'cash',
In [46]:
Out[46]:
'need,improvement,mobile,hell,hour,internet,lie,amazon,lenove,battery,mah,
booster,charger,hour,don,please,regret,cash,phone,everthey,phone,problem,p
hone,amazon,i,buyi,batterypoor,camerawaste,money,phone,awesome,heat,allot,
reason,hate,lenovo,k,note,battery,level,worn,problem,phone,problem,lenovo,
k,note,service,station,year,warranty,change,phone,lenovo,lot,glitch,dont,t
hing,option,wrost,phone,charger,damage,month,purchase,item,heating,batter
y,life,i,battery,problem,motherboard,problem,month,life,phone,slim,battry,
backup,screen,love,headset,time,product,range,specification,comparison,ran
ge,i,phone,amazon,seal,i,i,credit,card,i,r,deal,amazon,battery,i,solution,
battery,life,smartphone,galery,problem,atmos,speaker,phone,camera,speed,fe
ature,excelent,battery,product,product,camera,battery,phone,product,lenov
o,option,cast,screen,call,option,doesn,hotspot,phone,usb,cable,phone,pric
e,lenovo,display,specification,function,phone,i,fon,i,speekars,i,phone,iss
ue,color,screen,oreo,battery,heating,problem,phone,battery,update,oreo,si
m,customer,service,performance,battery,get,camera,backup,bestin,pricefull,
passa,wasole,phone,phone,performance,signal,restarts,phone,bcoms,plzz,don
t,buy,round,performance,r,day,trust,deal,amazon,disappointment,problem,hea
dache,problem,call,range,phone,rate,camera,quality,product,mobile,price,fe
10
In [63]:
In [64]:
wc = wordcloud.generate(word_str)
In [65]:
Out[65]:
In [66]:
10/12
11
In [67]:
str_list
Out[67]:
['need,improvement',
'mobile,hell,hour,internet,lie,amazon,lenove,battery,mah,booster,charger,
hour,don,please,regret',
'cash',
'',
'phone,everthey,phone,problem,phone,amazon',
'i,buyi,batterypoor,camerawaste,money',
'phone,awesome,heat,allot,reason,hate,lenovo,k,note',
'battery,level,worn',
'problem,phone,problem,lenovo,k,note,service,station,year,warranty,chang
e,phone,lenovo',
'lot,glitch,dont,thing,option',
'wrost',
'phone,charger,damage,month',
'purchase,item,heating,battery,life',
'i,battery,problem,motherboard,problem,month,life',
'phone,slim,battry,backup,screen,love',
'headset',
In [68]:
v1 = CountVectorizer().fit(str_list)
In [69]:
v1.get_feature_names()
Out[69]:
['aa',
'aab',
'aachha',
'aaj',
'aajata',
'aal',
'aap',
'aapka',
'aapki',
'aapko',
'aapne',
'aaps',
'aapse',
'aashiyana',
'aata',
'aate',
'aati',
'aavashyakta',
In [70]:
v2 = v1.transform(str_list)
12
In [71]:
v2.toarray()
Out[71]:
In [72]:
In [73]:
# Applying NB
nb_mod = MultinomialNB().fit(x_tr,y_tr)
In [74]:
pred = nb_mod.predict(x_test)
pred
Out[74]:
In [75]:
Out[75]:
0.7176342327609703
In [ ]:
1
In [ ]:
# ARCHANA NAIR
#M.Sc. Computer Science (With Specialization in Data Science)
#University of Mumbai
#PRACTICAL 10 Implement Regression
In [13]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pylab
In [14]:
df = pd.read_csv("C:/Users/archa/Downloads/Admission_Predict.csv")
df.head()
Out[14]:
In [15]:
df.dtypes
Out[15]:
In [16]:
df.drop('Serial No.',axis=1,inplace=True)
In [17]:
df.describe()
Out[17]:
TOEFL University C
GRE Score SOP LOR CGPA Research
Score Rating
In [18]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True,cmap='tab10')
Out[18]:
<AxesSubplot:>
4
In [19]:
sns.pairplot(df)
Out[19]:
<seaborn.axisgrid.PairGrid at 0x1412c3a4580>
5
In [20]:
In [21]:
x_tr,x_test,y_tr,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)
In [22]:
lm1=LinearRegression().fit(x_tr,y_tr)
In [23]:
pred = lm1.predict(x_test)
In [24]:
acc=r2_score(y_test,pred)
acc
Out[24]:
0.7792013613144768
In [25]:
Accuracy_Table = pd.DataFrame({'Model_Name':['Multiple_Linear_Regression'],'Accuracy':[acc]
In [26]:
resid = np.array(pred)-np.array(y_test)
6
In [27]:
import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()
FEATURE SELECTION
In [30]:
In [31]:
X_scaled=StandardScaler().fit_transform(X)
7
In [32]:
X_scaled
Out[32]:
In [61]:
fs_model=SelectKBest(f_regression,k=3).fit(X_scaled,Y)
In [62]:
fs_model.get_support(indices=True)
Out[62]:
In [63]:
X.iloc[:,[0,1,5]]
Out[63]:
PCA
In [28]:
pca_mod=PCA(n_components=0.95)
In [33]:
x_pca=pca_mod.fit_transform(X_scaled)
In [34]:
pca_mod.explained_variance_
Out[34]:
In [35]:
Out[35]:
[<matplotlib.lines.Line2D at 0x14130040970>]
In [36]:
x_tr,x_test,y_tr,y_test = train_test_split(x_pca,Y,test_size=0.2,random_state=100)
In [37]:
lm3=LinearRegression().fit(x_tr,y_tr)
In [39]:
pred=lm3.predict(x_test)
In [40]:
acc=r2_score(pred,y_test)
acc
Out[40]:
0.7564181483461111
10
In [41]:
Accuracy_Table=Accuracy_Table.append({'Model_Name':'PCA','Accuracy':acc},ignore_index=True)
In [42]:
resid = np.array(pred)-np.array(y_test)
In [43]:
import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()
In [46]:
x_tr,x_test,y_tr,y_test = train_test_split(X,Y,test_size=0.2,random_state=100)
In [47]:
Lasso_lm=Lasso(alpha=0.00001).fit(x_tr,y_tr)
11
In [48]:
pred=Lasso_lm.predict(x_test)
In [49]:
acc=r2_score(pred,y_test)
acc
Out[49]:
0.7626990495351637
In [50]:
resid = np.array(pred)-np.array(y_test)
In [51]:
import scipy.stats as s
s.probplot(resid,dist="norm",plot=pylab)
pylab.show()
In [52]:
Accuracy_Table=Accuracy_Table.append({'Model_Name':'Lasso_Regression','Accuracy':acc},ignor
12
In [53]:
Accuracy_Table
Out[53]:
Model_Name Accuracy
0 Multiple_Linear_Regression 0.779201
1 PCA 0.756418
2 Lasso_Regression 0.762699
Out[54]:
In [55]:
Out[55]:
In [56]:
Out[56]:
In [57]:
Out[57]:
In [58]:
Out[58]:
In [59]:
Out[59]:
In [60]:
Out[60]:
In [ ]: