ML Lab Records
ML Lab Records
Autonomous Bengaluru
YEAR: 2020-2021
I N D E X
PART - A
2. Model building 29
3. Supervised Learning
b. Decision Tree 50
d. Random Forest 57
PART - A
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Import Dataset
A = pd.read_csv('pima-indians-diabetes_csv.csv.csv')
A
A = pd.read_csv('pima-indians-diabetes_csv.csv.csv',header=None)
A
A.columns = ['Pregnant','Glucose','BP','ST','Insulin','BMI','Diabetes','Age','Class']
A
A.head()
A.tail()
print('Dataset Dimension')
print(A.shape)
print(A.info())
A.iloc[:,3:5]
A['ST']
A.loc[:,'ST']
A.iloc[1,:]
Dataset Statistics
print('Mean Age')
print(A['Age'].mean())#A.iloc[s:e,ci]
print('Median Age')
print(A['Age'].median())
print('Mode of the column Age')
print(A['Age'].mode())
print('Maximum Age')
print(A['Age'].max())
print('Minimum Age')
print(A['Age'].min())
A['Class'].value_counts()
A['Pregnant'].value_counts()
Summary Statistics
A['BMI'].describe()
Dataset Visualization
Histogram
A_int.head()
A_int.hist(figsize=(16,20),bins=50)
A.hist(figsize=(16,20),bins=5)
Bar Chart
# Matplotlib Library
x = A['Class'].unique()
y = A['Class'].value_counts()
plt.bar(x,y)
y.plot(kind='bar')
A['Pregnant'].value_counts().plot(kind='bar',color='purple')
Pie Chart
y.plot(kind='pie')
A['Pregnant'].value_counts().plot(kind='pie', figsize=(10,10))
Scatter Plot
import pandas as pd
import numpy as np
df=pd.read_csv('auto-mpg.csv')
df.head()
df.shape
df
df['mpg'].max()
df.info()
df=pd.read_csv('auto-mpg_1.csv')
df.head()
df.info()
df.isnull().sum()
median1 = df['mpg'].median()
median2 = df['cylinders'].median()
median3 =
df['displacement'].median() median4
= df['weight'].median() median5 =
df['acceleration'].median()
df['mpg'].replace(np.nan,median1,inplace=True)
df['cylinders'].replace(np.nan,median2,inplace=True)
df['displacement'].replace(np.nan,median3,inplace=True)
df['weight'].replace(np.nan,median4,inplace=True)
df['acceleration'].replace(np.nan,median5,inplace=True)
df.info()
df['horsepower'].replace(np.nan,mode1,inplace=True)
df.info()
type(mode1)
duplicate = df.duplicated()
duplicate
duplicate.sum()
df_1 = pd.read_csv('auto-mpg_1.csv')
duplicate = df_1.duplicated()
duplicate.sum()
df_1.shape
duplicate.shape
type(duplicate.values[0])
df_1.drop_duplicates(inplace=True)
duplicate=df_1.duplicated()
duplicate.sum()
df_1.shape
df_1.boxplot(column=['mpg','acceleration'])
df_1.boxplot(column=['displacement'])
df_1.boxplot(column=['weight'])
print('Q1 =',Q1)
print('Q2 =',Q2)
print('Q3 =',Q3)
Q1 = np.percentile(d,25)
Q2 = np.percentile(d,50)
Q3 = np.percentile(d,75)
print('Q1=',Q1)
print('Q2=',Q2)
print('Q3=',Q3)
IQR = Q3 - Q1
print('Interquartile range is', IQR)
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 * IQR
print('low_limit is',low_lim)
print('up_limit is',up_lim)
outlier = []
for x in d[0]:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
print('outlier in the dataset is', outlier)
df.boxplot(column=['mpg'])
Q1 = np.percentile(df['mpg'],25)
Q2 = np.percentile(df['mpg'],50)
Q3 = np.percentile(df['mpg'],75)
print('Q1=',Q1)
print('Q2=',Q2)
print('Q3=',Q3)
IQR = Q3 - Q1
print('Interquartile range is', IQR)
low_lim = Q1 - 1.5 * IQR
up_lim = Q3 + 1.5 * IQR
print('low_limit is',low_lim)
print('up_limit is',up_lim)
outlier = []
for x in df['mpg']:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
print('outlier in the dataset is', outlier)
df.info()
df_trim.info()
df_cap = df.copy()
df_cap['mpg'] = np.where(df_cap['mpg']>up_lim, up_lim, df_cap['mpg'])
df_cap['mpg'] = np.where(df_cap['mpg']<low_lim, low_lim, df_cap['mpg'])
outlier = []
for x in df['mpg']:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
print('outlier in the dataset is', outlier)
outlier = []
for x in df_cap['mpg']:
if((x > up_lim) or (x < low_lim)):
outlier.append(x)
df_cap.info()
Feature Scaling
import numpy as np
X = np.array([1,4,6.7,8,10.9])
norm_X = (X - np.min(X))/(np.max(X)-np.min(X))
print(X)
print(norm_X)
np.max(norm_X)
import pandas as pd
import numpy as np
diabetes = pd.read_csv('pima-indians-diabetes.csv.csv')
diabetes.head()
S = MinMaxScaler()
diabetes_scaled = S.fit_transform(diabetes)
diabetes_scaled
np.max(diabetes_scaled)
2. Z- Score Normalization
X = np.array([1.4,6.7,8,10.9])
X_Znorm = (X-np.mean(X))/np.std(X)
X_Znorm
np.max(X_Znorm)
diabetes_Zscore = Stdscaler.fit_transform(diabetes)
diabetes_Zscore
np.min(diabetes_Zscore)
np.max(diabetes_Zscore)
Import Libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
iris = datasets.load_iris()
type(iris)
iris
df.head()
df.shape
df['target'] = iris.target
df.tail()
Dataset Exploration
df.info()
df.describe()
df.boxplot()
df.groupby('target').size()
X = df.drop(['target'], axis=1)
print(X)
Y = df['target']
Y
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]
ytrain.value_counts()
ytest.value_counts()
105/3
Z - Score Standardization
xtrain
ytrain
S = StandardScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)
xtest
np.min(xtest)
np.max(xtest)
Model Creation
model = LogisticRegression(solver='liblinear',random_state=0)
model
model.fit(xtrain,ytrain)
ytest[1]
y_pred = model.predict(xtest)
ytest[1]
y_pred[1]
Confusion Matrix
cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)
Classification Accuracy
Acc = (cm[0,0]+cm[1,1]+cm[2,2])/np.sum(cm)
print(Acc)
Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)
ytrain.value_counts()
ytest.value_counts()
#S = StandardScaler()
S = MinMaxScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)
cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)
Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)
IQR = Q3-Q1
low_limit = Q1-1.5*IQR
up_limit = Q3+1.5*IQR
Outliers Trimming
df_trim = df[(df['sepal width (cm)'] > low_limit) & (df['sepal width (cm)'] < up_limit)]
df_trim.shape
X = df_trim.drop(['target'],axis=1)
Y = df_trim['target']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]
ytrain.value_counts()
ytest.value_counts()
S = StandardScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)
model = LogisticRegression(solver='liblinear',random_state=0)
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)
cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)
Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)
Outlier Capping
df_cap = df.copy()
df_cap.shape
X = df_cap.drop(['target'],axis=1)
Y = df_cap['target']
xtrain,xtest,ytrain,ytest = train_test_split(X,Y,test_size=0.3,random_state=4,stratify=Y)
[xtrain.shape,xtest.shape,ytrain.shape,ytest.shape]
ytrain.value_counts()
ytest.value_counts()
S = StandardScaler()
xtrain = S.fit_transform(xtrain)
xtest = S.fit_transform(xtest)
model = LogisticRegression(solver='liblinear',random_state=0)
model.fit(xtrain,ytrain)
y_pred = model.predict(xtest)
cm = metrics.confusion_matrix(ytest,y_pred)
print(cm)
Accuracy = metrics.accuracy_score(ytest,y_pred)
print(Accuracy)
Import Libraries
import numpy as np
import pandas as pd
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Load Dataset
data = pd.read_csv('diabetes (1).csv')
data.head()
Y = data['Outcome']
[X.shape, Y.shape]
Y.value_counts()
ytest.value_counts()
Calculation of accuracy
accuracy = metrics.accuracy_score(ytest,y_pred)
print(accuracy)
For sensitivity calculation, consider the row corresponding to disease positive i.e., Row 2 in this
case
sensitivity = cm[1,1]/(cm[1,0]+cm[1,1])
print(sensitivity)
For specificity calculation, consider the row corresponding to disease negative i.e., Row 1 in this
case
specificity = cm[0,0]/(cm[0,0]+cm[0,1])
print(specificity)
y_score = model.decision_function(xtest)
ytest[0]
y_pred[0]
y_score[0]
np.max(y_score)
np.min(y_score)
roc_auc
y_score = model.decision_function(xtest)
fpr[i],tpr[i],_ = metrics.roc_curve(ytest,y_score)
roc_auc[i] = metrics.auc(fpr[i], tpr[i])
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics, datasets
iris = datasets.load_iris()
iris
iris.feature_names
X = iris.data
Y = iris.target
X.shape
y_pred = clf.predict(x_test)
import pydotplus
export_graphviz(clf,out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names=iris.feature_names,
class_names=iris.target_names)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('iris_entropy.png')
Image(graph.create_png())
import numpy as np
import pandas as pd
X,Y=make_classification(n_samples=200,n_features=8,n_informative=8,
n_redundant=0,n_repeated=0,n_classes=2,random_state=14)
X[0]
Standard Scaling
SC = StandardScaler()
SC.fit_transform(X)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.2,random_state=10)
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)
pd.DataFrame(data = Y_train).value_counts()
y_pred = knn.predict(X_test)
acc = metrics.accuracy_score(Y_test,y_pred)
print(acc)
error_test=[]
x=[1,3,5,7,9,11,13,15]
for k in x:
knn=KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train,Y_train)
y_pred1=knn.predict(X_train)
y_pred2=knn.predict(X_test)
acc=metrics.accuracy_score(Y_test,y_pred2)
error_train.append(np.mean(Y_train!=y_pred1))
error_test.append(np.mean(Y_test!=y_pred2))
plt.plot(x,error_train,label='Train')
plt.plot(x,error_test,label='Test')
plt.xlabel('K Value')
plt.ylabel('Error')
plt.legend()
plt.title("Error Curve")
principalcomponents=pca.fit_transform(X)
pc=pd.DataFrame(data=principalcomponents,columns=['PC1','PC2','PC3','PC4'])
X_train,X_test,Y_train,Y_test=train_test_split(principalcomponents,Y,test_size=0.2,
random_state=10)
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,Y_train)
y_pred=knn.predict(X_test)
acc=metrics.accuracy_score(Y_test,y_pred)
print(acc)
import numpy as np
import pandas as pd
cancer
cancer.data
cancer.target
cancer.feature_names
data['target']=cancer.target
data.head()
X=cancer.data # method 2
Y=data['target'] # method 1
Y=cancer.target # method 2
Y_pred=clf.predict(X_test)
Classifier performance
print(metrics.confusion_matrix(Y_test,Y_pred))
print(metrics.accuracy_score(Y_test,Y_pred))
Feature Importance
fea_imp=pd.Series(clf.feature_importances_,index=cancer.feature_names).sort_values(ascending=F
alse)
fea_imp
Y1=data['target']
X_train1,X_test1,Y_train1,Y_test1=train_test_split(X1,Y1,test_size=0.3,random_state=3,stratify=
Y)
import pandas as pd
Create a dataset
X,Y=make_regression(n_samples=100,n_features=20,n_informative=15,noise=0.1,random_state=2
)
print(Y[:10])
Create a Model
model=RandomForestRegressor(n_estimators=10)
y_pred=model.predict(X_test)
Y_test
mse = metrics.mean_squared_error(y_pred,Y_test)
print(mse)
print(df1.head()
print(df2.head()
print(df3.head()
plt.scatter(df1.Age,df1['Income($)'], color='red')
plt.scatter(df2.Age,df2['Income($)'], color='green')
plt.scatter(df3.Age,df3['Income($)'], color='blue')
The data points are not clustered properly as the x and y axes range differ drastically. This improper
clustering can be overcome by feature scaling
scaler = MinMaxScaler()
df['Age']=scaler.fit_transform(df[['Age']])
df['Income($)']=scaler.fit_transform(df[['Income($)']])
df.head()
df.drop(['cluster'],axis=1)
km = KMeans(n_clusters=4)
y_pred = km.fit_predict(df[['Age','Income($)']])
df['cluster'] = y_pred
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
df4 = df[df.cluster==3]
km.cluster_centers_
plt.scatter(df1.Age,df1['Income($)'], color='red')
plt.scatter(df2.Age,df2['Income($)'], color='green')
plt.scatter(df3.Age,df3['Income($)'], color='blue')
plt.scatter(df4.Age,df4['Income($)'], color='pink')
plt.scatter(km.cluster_centers_[:,0],
km.cluster_centers_[:,1], marker='*', s=200, color='purple', label='centeroid')
plt.legend()
sse
plt.xlabel('K')
plt.ylabel('Income')
plt.plot(k_rng,sse)
plt.grid()
s_score=[]
k_rng=range(2,12)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit_predict(df[['Age','Income($)']])
S = silhouette_score(df[['Age','Income($)']],km.labels_)
print("k:{} and Silhouette Score:{}".format(k,S))
s_score.append(S)
plt.xlabel('K')
plt.ylabel('Income')
plt.plot(k_rng,s_score)
plt.grid()
22/10/2021
Program 4 b. Unsupervised Learning – Hierarchical Clustering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch
from sklearn.datasets import make_blobs
X[0:5,:]
plt.scatter(X[:,0],X[:,1])
plt.scatter(X[:,0],X[:,1],c=labels,cmap='rainbow')
df.head()
df.info
X_arr = np.array(X)
plt.figure(figsize=(10,10))
dend = sch.dendrogram(sch.linkage(X,method='ward'))
plt.scatter(X_arr[:,0],X_arr[:,1],c=labels,cmap='rainbow')
S = MinMaxScaler()
X_sca = S.fit_transform(X_arr)
dend = sch.dendrogram(sch.linkage(X_sca,method='ward'))
plt.scatter(X_sca[:,1],X_sca[:,2],c=labels,cmap='rainbow')
Standard Scaling
from sklearn.preprocessing import StandardScaler
S = StandardScaler()
X_sca = S.fit_transform(X_arr)
dend = sch.dendrogram(sch.linkage(X_sca,method='ward'))
plt.scatter(X_sca[:,1],X_sca[:,2],c=labels,cmap='rainbow')
plt.scatter(X_sca[:,1],X_sca[:,2],c=labels,cmap='rainbow')
04/11/2021
Program 5 Neural Network Model
import sklearn.datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
Data.head()
Data.tail(10)
X = Data.drop('Outcome',axis=1)
Y = Data['Outcome']
Y.head()
S = MinMaxScaler()
X = S.fit_transform(X)
X[0,:]
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3,stratify=Y,random_state=8)
X_train.shape
model = MLPClassifier(hidden_layer_sizes=(15,50,20,23,20,50),activation='tanh',solver='adam',bat
ch_size='auto',max_iter=1000)
model.fit(X_train,Y_train)
y_pred = model.predict(X_test)
cm = metrics.confusion_matrix(Y_test,y_pred)
print(cm)
print(metrics.accuracy_score(Y_test,y_pred))
PART - B
INTRODUCTION
House is one of human life's most essential needs, along with other fundamental needs such as food,
water, and much more. Demand for houses grew rapidly over the years as people's living
standards improved. While there are
people who make their house as an investment and property, yet most people around the world are
buying a house as their shelter or as their livelihood.
House price prediction can be done by using many prediction models (Machine Learning Model)
such as support vector regression, linear regression, and more. There are many benefits that
home buyers, property investors, and house builders can reap from the house-price model. This
model will provide a lot of information and knowledge to home buyers, property investors and
house builders, such as the valuation of house prices in the present market, which will help them
determine house prices.
The three-performance metrics including mean squared error (MSE), root mean squared error
(RMSE) and mean absolute percentage error (MAPE) associated with the following algorithms also
unambiguously outperform those of SVM.
Linear Regression is the simplest algorithm in machine learning, it can be trained in different
ways. In this notebook we will cover the following linear algorithms:
1. Linear Regression
2. Robust Regression
3. Ridge Regression
4. LASSO Regression
5. Elastic Net
6. Polynomial Regression
7. Stochastic Gradient Descent
8. Artificial Neural Network
'Avg. Area Income': Avg. Income of residents of the city house is located in.
'Avg. Area House Age': Avg Age of Houses in same city
'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
'Area Population': Population of city house is located in
'Price': Price that the house sold at
'Address': Address for the house
Code Snippet:
Exploratory Data Analysis (EDA)
Linear Regression
Robust Regression
Ridge Regression
Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a
penalty on the size of coefficients.
LASSO Regression
A linear model that estimates sparse coefficients.
Mathematically, it consists of a linear model trained with ℓ1ℓ1 prior as regularizer. The objective
function to minimize is:
Elastic Net
Polynomial Regression
One common pattern within machine learning is to use linear models trained on nonlinear functions
of the data. This approach maintains the generally fast performance of linear methods, while
allowing them to fit a much wider range of data.
For example, a simple linear regression can be extended by constructing polynomial features from
the coefficients.
Models Comparison
Modeling steps
1. Exploratory Data Analysis: analyze and investigate data sets and summarize their
main characteristics, often employing data visualization methods.
2. Train test split: we separate 70% of our dataset for training the model and 30% of the dataset
for testing the model
3. Scale the data: using standard scaler, we will scale the data so that all of our variables fall
within the range of -1 to 1
4. Create a predictions data frame: generate a data frame that includes the actual price of the
house captured in our test set and the predicted results from our model so that we can quantify
our success
5. Score the models: The root mean squared error (RMSE) and mean absolute error (MAE), mean
squared error (MSE), R2 Square and Cross Validation of our predictions to compare
performance of our five models
Comparing Models
To compare model performance, we will look at root mean squared error (RMSE) and mean absolute
error (MAE). These measurements are both commonly used for comparing model performance, but
they have slightly different intuition and mathematical meaning.
MAE: the mean absolute error tells us on average how far our predictions are from the true
value. In this case, all errors receive the same weight.
RMSE: we calculate RMSE by taking the square root of the sum of all of the squared errors.
When we square, the larger errors have a greater impact on the overall error while smaller
errors do not have as much weight on the overall error.
We use get_scores to calculate the RMSE and MAE scores for each model
CONCLUSION
To predict house prices, simple to advanced machine learning algorithms have been implemented,
such as Linear Regression, Ridge Regression, SVM Regressor Polynomial Regressor, Elastic Net
Regressor Lasso Regressor, Robust Regressor , Random Forest. It has been observed that all the
Regressor except the random forest Regressor yield 90% and above of efficiency.
The outcome of machine learning algorithms will help to select the most suitable demand prediction
algorithm.