Machine Learning Record VR19
Machine Learning Record VR19
Description :
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data Preprocessing
is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered
from different sources it is collected in raw format which is not feasible for the analysis.
Need of Data Preprocessing
• For achieving better results from the applied model in Machine Learning projects the format of the data has to be
in a proper manner. Some specified Machine Learning model needs information in a specified format, for
example, Random Forest algorithm does not support null values, therefore to execute random forest algorithm null
values have to be managed from the original raw data set.
• Another aspect is that the data set should be formatted in such a way that more than one Machine Learning and
Deep Learning algorithm are executed in one data set, and best out of them is chosen.
Dataset Description :
The Carseats data set tracks sales information for car seats. It has 400 observations (each at a different store) and 11
variables:
Code :
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
1 | Page
dt= pd.read_csv('Carseats.csv')
x=dt.iloc[:, 1:11]
y=dt.iloc[:, 0]
The iloc() function in python is defined in the Pandas module, which helps us select a specific row
or column from the data set. Here in x we are selecting all rows and columns from 1 to 11 and in y
we are selecting all rows and first column i.e., 0.
print(x)
Education Urban US
0 17 Yes Yes
1 10 Yes Yes
2 12 Yes Yes
3 14 Yes Yes
4 13 Yes No [400 rows x 10 columns]
.. ... ... ...
395 14 Yes Yes
396 11 No Yes
397 18 Yes Yes
398 12 Yes Yes
399 16 Yes Yes
print(x.head())
Education Urban US
1 17 Yes Yes
2 10 Yes Yes
3 12 Yes Yes
4 14 Yes Yes
5 13 Yes No
2 | Page
The head() function displays the first five rows of the dataframe by default.
print(y)
0 9.50
1 11.22
2 10.06
3 7.40
4 4.15
...
395 12.57
396 6.14
397 7.41
398 5.94
399 9.71
Name: Sales, Length: 400, dtype: float64
print(y.head())
0 9.50
1 11.22
2 10.06
3 7.40
4 4.15
Name: Sales, dtype: float64
le=LabelEncoder()
x.iloc[:,8]=le.fit_transform(x.iloc[:,8])
x.iloc[:,9]=le.fit_transform(x.iloc[:,9])
x.head()
CompPrice Income Advertising Population Price ShelveLoc Age Education Urban
0 138 73 11 276 120 Bad 42 17 1
1 111 48 16 260 83 Good 65 10 1
2 113 35 10 269 80 Medium 59 12 1
3 117 100 4 466 97 Medium 55 14 1
4 141 64 3 340 128 Bad 38 13 1
For coverting the labels which have categorical values into numerical values, we use Label Encoder.
ohe=OneHotEncoder()
x1=pd.DataFrame(ohe.fit_transform(x[['ShelveLoc']]).toarray())
x1.head()
0 1 2
3 | Page
One-hot encoding converts the categorical data into numeric data by splitting the column into
multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what
value.
x=x.join(x1)
x.head()
We can drop the redundant columns or the columns which are not of use.
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
Dividing the dataset into test dataset and trainning data by using train_test_split(), here 20% of the
data is taken for testing and remaining 80% for training.
sc=StandardScaler()
x_train.iloc[:,0:7]=sc.fit_transform(x_train.iloc[:,0:7])
x_test.iloc[:,0:7]=sc.fit_transform(x_test.iloc[:,0:7])
x_train.head()
4 | Page
EXPERIMENT -02
Aim : To convert continuous values to categorical values for the given dataset.
Description :
Numerical data such as continuous, highly skewed data is frequently seen in data analysis. Sometimes analysis
becomes effortless on conversion from continuous to discrete data. There are many ways in which conversion can be
done, one such way is by using Pandas’ integrated cut-function. Pandas’ cut function is a distinguished way of
converting numerical continuous data into categorical data. It has 3 major necessary parts:
1. First and foremost is the 1-D array/Dataframe required for input.
2. The other main part is bins. Bins that represent boundaries of separate bins for continuous data. The first number
denotes the start point of the bin and the following number denotes the endpoint of the bin. Cut function permits
more explicitness of the bins
3. The final main part is labels. The number of labels without exception will be one lower than the number of bins.
Dataset Description :
The Carseats data set tracks sales information for car seats. It has 400 observations (each at a different store) & 11 variables:
• CompPrice: price charged by competitor at each location
• Sales : unit sales in thousands
• Income: community income level in 1000s of dollars
• Advertising: local ad budget at each location in 1000s of dollars
• Population: regional pop in thousands
• Price: price for car seats at each site
• ShelveLoc: Bad, Good or Medium indicates quality of shelving location
• Age: age level of the population
• Education: ed level at location
• Urban: Yes/No
• US: Yes/No
Code :
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
5 | Page
The iloc() function in python is defined in the Pandas module, which helps us select a specific row
or column from the data set. Here in x we are selecting all rows and columns from 1 to 11 and in y
we are selecting all rows and first column i.e., 0.
X
array([[138, 73, 11, ..., 17, 'Yes', 'Yes'],
[111, 48, 16, ..., 10, 'Yes', 'Yes'],
[113, 35, 10, ..., 12, 'Yes', 'Yes'],
...,
[162, 26, 12, ..., 18, 'Yes', 'Yes'],
[100, 79, 7, ..., 12, 'Yes', 'Yes'],
[134, 37, 0, ..., 16, 'Yes', 'Yes']], dtype=object)
Values of x dataset
y
array([ 9.5 , 11.22, 10.06, 7.4 , 4.15, 10.81, 6.63, 11.85, 6.54,
4.69, 9.01, 11.96, 3.98, 10.96, 11.17, 8.71, 7.58, 12.29,
13.91, 8.73, 6.41, 12.13, 5.08, 5.87, 10.14, 14.9 , 8.33,
5.27, 2.99, 7.81, 13.55, 8.25, 6.2 , 8.77, 2.67, 11.07,
8.89, 4.95, 6.59, 3.24, 2.07, 7.96, 10.43, 4.12, 4.16,
4.56, 12.44, 4.38, 3.91, 10.61, 1.42, 4.42, 7.91, 6.92,
4.9 , 6.85, 11.91, 0.91, 5.42, 5.21, 8.32, 7.32, 1.82,
8.47, 7.8 , 4.9 , 8.85, 9.01, 13.39, 7.99, 9.46, 6.5 ,
5.52, 12.61, 6.2 , 8.55, 10.64, 7.7 , 4.43, 9.14, 8.01,
7.52, 11.62, 4.42, 2.23, 8.47, 8.7 , 11.7 , 6.56, 7.95,
5.33, 4.81, 4.53, 8.86, 8.39, 5.58, 9.48, 7.45, 12.49,
4.88, 4.11, 6.2 , 5.3 , 5.07, 4.62, 5.55, 0.16, 8.55,
3.47, 8.98, 9. , 6.62, 6.67, 6.01, 9.31, 8.54, 5.08,
8.8 , 7.57, 7.37, 6.87, 11.67, 6.88, 8.19, 8.87, 9.34,
11.27, 6.52, 4.96, 4.47, 8.41, 6.5 , 9.54, 7.62, 3.67,
6.44, 5.17, 6.52, 10.27, 12.3 , 6.03, 6.53, 7.44, 0.53,
9.09, 8.77, 3.9 , 10.51, 7.56, 11.48, 10.49, 10.77, 7.64,
5.93, 6.89, 7.71, 7.49, 10.21, 12.53, 9.32, 4.67, 2.93,
3.63, 5.68, 8.22, 0.37, 6.71, 6.71, 7.3 , 11.48, 8.01,
12.49, 9.03, 6.38, 0. , 7.54, 5.61, 10.48, 10.66, 7.78,
4.94, 7.43, 4.74, 5.32, 9.95, 10.07, 8.68, 6.03, 8.07,
12.11, 8.79, 6.67, 7.56, 13.28, 7.23, 4.19, 4.1 , 2.52,
3.62, 6.42, 5.56, 5.94, 4.1 , 2.05, 8.74, 5.68, 4.97,
8.19, 7.78, 3.02, 4.36, 9.39, 12.04, 8.23, 4.83, 2.34,
5.73, 4.34, 9.7 , 10.62, 10.59, 6.43, 7.49, 3.45, 4.1 ,
6.68, 7.8 , 8.69, 5.4 , 11.19, 5.16, 8.09, 13.14, 8.65,
9.43, 5.53, 9.32, 9.62, 7.36, 3.89, 10.31, 12.01, 4.68,
7.82, 8.78, 10. , 6.9 , 5.04, 5.36, 5.05, 9.16, 3.72,
8.31, 5.64, 9.58, 7.71, 4.2 , 8.67, 3.47, 5.12, 7.67,
5.71, 6.37, 7.77, 6.95, 5.31, 9.1 , 5.83, 6.53, 5.01,
11.99, 4.55, 12.98, 10.04, 7.22, 6.67, 6.93, 7.8 , 7.22,
3.42, 2.86, 11.19, 7.74, 5.36, 6.97, 7.6 , 7.53, 6.88,
6.98, 8.75, 9.49, 6.64, 11.82, 11.28, 12.66, 4.21, 8.21,
3.07, 10.98, 9.4 , 8.57, 7.41, 5.28, 10.01, 11.93, 8.03,
4.78, 5.9 , 9.24, 11.18, 9.53, 6.15, 6.8 , 9.33, 7.72,
6.39, 15.63, 6.41, 10.08, 6.97, 5.86, 7.52, 9.16, 10.36,
2.66, 11.7 , 4.69, 6.23, 3.15, 11.27, 4.99, 10.1 , 5.74,
5.87, 7.63, 6.18, 5.17, 8.61, 5.97, 11.54, 7.5 , 7.38,
7.81, 5.99, 8.43, 4.81, 8.97, 6.88, 12.57, 9.32, 8.64,
10.44, 13.44, 9.45, 5.3 , 7.02, 3.58, 13.36, 4.17, 3.13,
8.77, 8.68, 5.25, 10.26, 10.5 , 6.53, 5.98, 14.37, 10.71,
10.26, 7.68, 9.08, 7.8 , 5.58, 9.44, 7.9 , 16.27, 6.81,
6.11, 5.81, 9.64, 3.9 , 4.95, 9.35, 12.85, 5.87, 5.32,
8.67, 8.14, 8.44, 5.47, 6.1 , 4.53, 5.57, 5.35, 12.57,
6.14, 7.41, 5.94, 9.71])
6 | Page
Values of y
dataset
avg=y.mean()
print(avg)
7.496325000000001
Calculating the mean or average of the values of y dataset which is a required parameter for the
conversion of continous to categorical values.
maxi=y.max()
print(maxi)
16.27
Calculating the maximum value of the values of y dataset which is a required parameter for the
conversion of continous to categorical values.
['Good', 'Good', 'Good', 'Bad', 'Bad', ..., 'Good', 'Bad', 'Bad', 'Bad', 'Good']
Length: 400
Categories (2, object): ['Bad' < 'Good']
Pandas cut function is used for converting numerical continuous data into categorical data. Here
the values of 'y' dataset are in numerical form, we have converted these values into two categorical
values 'Good','Bad' based upon the calculae average and maximum values of y.
If the value of 'y' lies in the range between 0 and average value, then it is categorized as 'Bad'.
Else if the value of 'y' lies in the range between average value and maximum value, then it is
categorized as 'Good'.
The bins parameter denotes the bin boundaries for segmentation.
The right parameter denotes whether rightmost edge of bins should be included or not.It is of
Boolean type and default value is True.
The labels parameter defines labels for returned segmented bins.
print(y)
['Good', 'Good', 'Good', 'Bad', 'Bad', ..., 'Good', 'Bad', 'Bad', 'Bad', 'Good']
Length: 400
Categories (2, object): ['Bad' < 'Good']
7 | Page
EXPERIMENT -03
Aim : To perform linear regression on the given dataset.
Description :
The term regression is used when you try to find the relationship between variables. In Machine Learning, and in
statistical modeling, that relationship is used to predict the outcome of future events. Linear regression uses the
relationship between the data-points to draw a straight line through all them. It is a statistical method for modeling
relationships between a dependent variable with a given set of independent variables. It is assumed that the two
variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as
possible as a function of the feature or independent variable(x).
Dataset Description :
The Carseats data set tracks sales information for car seats. It has 400 observations (each at a different store) & 11
variables:
• CompPrice: price charged by competitor at each location
• Sales : unit sales in thousands
• Income: community income level in 1000s of dollars
• Advertising: local ad budget at each location in 1000s of dollars
• Population: regional pop in thousands
• Price: price for car seats at each site
• ShelveLoc: Bad, Good or Medium indicates quality of shelving location
• Age: age level of the population
• Education: ed level at location
• Urban: Yes/No
• US: Yes/No
Code :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
import matplotlib.pyplot as plt
import seaborn as sb
df=pd.read_csv('Carseats.csv')
x=df.iloc[:,1:11].values
y=df.iloc[:,0].values
8 | Page
The iloc() function in python is defined in the Pandas module, which helps us select a specific row
or column from the data set. Here in x we are selecting all rows and columns from 1 to 11 and in y
we are selecting all rows and first column i.e., 0.
x
array([[138, 73, 11, ..., 17, 'Yes', 'Yes'],
[111, 48, 16, ..., 10, 'Yes', 'Yes'],
[113, 35, 10, ..., 12, 'Yes', 'Yes'],
...,
[162, 26, 12, ..., 18, 'Yes', 'Yes'],
[100, 79, 7, ..., 12, 'Yes', 'Yes'],
[134, 37, 0, ..., 16, 'Yes', 'Yes']], dtype=object)
Values of x dataset
df.corr()
corr() is used to find the pairwise correlation of all columns in the Pandas Dataframe in Python.
k=df[['Advertising']]
l=df[['Sales']]
print(k)
Advertising
0 11
1 16
2 10
3 4
4 3
.. ...
395 17
396 3
397 12
398 7
399 0
print(l)
Sales
0 9.50
1 11.22
2 10.06
3 7.40
9 | Page
4 4.15
.. ...
395 12.57
396 6.14
397 7.41
398 5.94
399 9.71
X_train,X_test,y_train,y_test=train_test_split(k,l,test_size=0.2,random_state=0)
Dividing the dataset into test dataset and trainning data by using train_test_split(), here 20% of the
data is taken for testing and remaining 80% for training.
Linear Regression is a type of Regression algorithms that models the relationship between a
dependent variable and a single independent variable.
It is an approach for predicting a response using a single feature. It is assumed that the two
variables are linearly related. Hence, we try to find a linear function that predicts the response
value(y) as accurately as possible as a function of the feature or independent variable(x).
r=LinearRegression()
r.fit(X_train,y_train)
LinearRegression()
Linear regression uses the relationship between the data-points to draw a straight line through all
them. This line can be used to predict future values. Here x and y are k and l respectively.
y_predict=r.predict(X_test)
y_predict
array([[7.67649322],
[8.12173552],
[6.67469806],
[7.3425615 ],
[8.90090953],
[8.78959896],
[6.67469806],
[7.11994035],
[6.67469806],
[7.00862978],
[7.23125093],
[6.67469806],
[7.89911437],
[8.90090953],
[6.89731921],
[7.11994035],
[6.67469806],
[7.7878038 ],
[6.67469806],
[7.23125093],
10 | P a g e
[7.67649322],
[6.67469806],
[6.67469806],
[6.67469806],
[6.67469806],
[7.00862978],
[6.78600863],
[7.23125093],
[7.89911437],
[6.67469806],
[8.01042494],
[8.45566724],
[7.67649322],
[6.67469806],
[6.67469806],
[7.7878038 ],
[6.67469806],
[7.23125093],
[7.00862978],
[6.67469806],
[7.89911437],
[7.7878038 ],
[6.67469806],
[8.45566724],
[8.12173552],
[8.12173552],
[7.45387208],
[7.56518265],
[6.78600863],
[8.67828839],
[6.67469806],
[6.67469806],
[7.00862978],
[8.01042494],
[7.7878038 ],
[7.7878038 ],
[7.00862978],
11 | P a g e
plt.scatter(X_test, y_test, color = "red")
plt.plot(X_train, r.predict(X_train), color = "green")
plt.title("Testing_dataset")
plt.xlabel("citric acid")
plt.ylabel("quality")
plt.show()
print(metrics.mean_squared_error(y_test,y_predict))
6.3927399636374895
2.5283868303005947
12 | P a g e
EXPERIMENT -04
Aim : To perform Multiple Linear Regression for the given dataset.
Description :
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several
explanatory variables to predict the outcome of a response variable. The goal of multiple linear regression is to model
the linear relationship between the explanatory (independent) variables and response (dependent) variables. In essence,
multiple regression is the extension of ordinary least-squares (OLS) regression because it involves more than one
explanatory variable.
Dataset Description :
This particular dataset holds data from 50 startups in New York, California, and Florida. The features in this dataset
are R&D spending, Administration Spending, Marketing Spending, and location features, while the target variable
is: Profit.
1. R&D spending: The amount which startups are spending on Research and development.
2. Administration spending: The amount which startups are spending on the Admin panel.
3. Marketing spending: The amount which startups are spending on marketing strategies.
4. State: To which state that particular startup belongs.
5. Profit: How much profit that particular startup is making
Code :
import numpy as np
import pandas as pd
from numpy import math
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
dt=pd.read_csv('50_Startups.csv')
13 | P a g e
dt.shape
(50, 5)
the shape() method is used to fetch the dimensions of Pandas and NumPy type objects in python.
plt.scatter(dt['Marketing Spend'],dt ['Profit'],·alpha=0.5)
plt.title('Scatter·plot·of·Profit·with·Marketing')
plt.xlabel('Marketing·Spend')
plt.ylabel("Profit")
plt.show()
14 | P a g e
ax = dt.groupby(['State'])['Profit'].mean().plot.bar(
figsize = (10,5), fontsize = 14)
ax.set_title("Average profit for different states where the startups operate", fontsize=20)
ax.set_xlabel("State", fontsize = 15)
ax.set_ylabel("Profit", fontsize = 15)
dt.State.value_counts()
New York 17
California 17
Florida 16
Name: State, dtype: int64
dv='Profit'
iv=dt.columns.tolist()
15 | P a g e
iv.remove(dv
)iv
['R&D Spend',
'Administration',
'Marketing
Spend',
'NewYork_State',
'California_State'
,'Florida State']
X=dt[iv].value
s
y=dt[dv].value
s
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)
sc=MinMaxScaler()
x_train=sc.fit_transform(x_train
)x_test=sc.transform(x_test)
r=LinearRegression()
r.fit(x_train,y_train)
LinearRegression(
y_pred=r.predict(x_test)
math.sqrt(mean_squared_error(y_test,y_pred)
)9137.990152794944
r2_score(y_test,y_pred
0.934706847328242
16 | P a g e
EXPERIMENT - 05
Aim : To perform decision tree classification for the given dataset.
Description :
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the
corresponding output is in the training data) where the data is continuously split according to a certain parameter. The
tree can be explained by two entities, namely decision nodes and leaves.
• Decision tree algorithm falls under the category of supervised learning. They can be used to solve both
regression and classification problems.
• Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class
label and attributes are represented on the internal node of the tree.
• We can represent any boolean function on discrete attributes using the decision tree.
Dataset Description :
The data set pizza. csv contains measurements that capture the kind of things that make a pizza tasty. Can you determine
which pizza brand works best for you and explain why. The variables in the data set are:
brand -- Pizza brand (class label)
id -- Sample analysed
mois -- Amount of water per 100 grams in the sample
prot -- Amount of protein per 100 grams in the sample
fat -- Amount of fat per 100 grams in the sample
ash -- Amount of ash per 100 grams in the sample
sodium -- Amount of sodium per 100 grams in the sample
carb -- Amount of carbohydrates per 100 grams in the sample
cal -- Amount of calories per 100 grams in the sample
Code :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import seaborn as sb
df=pd.read_csv('Pizza.csv')
x=df.iloc[:,1:8]
y=df.iloc[:,0]
17 | P a g e
The iloc() function in python is defined in the Pandas module, which helps us select a specific row
or column from the data set. Here in x we are selecting all rows and columns from 1 to 11 and in y
we are selecting all rows and first column i.e., 0.
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
Dividing the dataset into test dataset and trainning data by using train_test_split(), here 20% of the
data is taken for testing and remaining 80% for training
sc=StandardScaler()
X_train.iloc[:,:]=sc.fit_transform(X_train.iloc[:,:])
X_test.iloc[:,:]=sc.fit_transform(X_test.iloc[:,:])
print(X_train)
[240rows x 7 columns]
d.fit(X_train,y_train)
DecisionTreeClassifier(criterion='entropy', random_state=0)
y_predict=d.predict(X_test)
y_predict
array(['E', 'G', 'A', 'H', 'I', 'E', 'E', 'H', 'B', 'D', 'J', 'A', 'G',
'I', 'A', 'C', 'J', 'D', 'E', 'F', 'H', 'D', 'E', 'E', 'E', 'D',
'F', 'J', 'H', 'J', 'D', 'C', 'A', 'E', 'H', 'F', 'G', 'F', 'J',
'A', 'H', 'D', 'C', 'B', 'B', 'E', 'E', 'C', 'A', 'B', 'C', 'H', 'F', 'D',
'B', 'I', 'F', 'A', 'J', 'F'], dtype=object)
m=metrics.accuracy_score(y_test,y_predict)
print(m*100)
85.0
18 | P a g e
Calculating the accuracy score metrics
cm = metrics.confusion_matrix(y_test, y_predict)
[[7 0 0 0 0 00 0 0 0]
[0 5 0 0 0 00 0 0 0]
[0 0 5 0 0 00 0 0 0]
[0 0 0 7 0 00 0 0 0]
[0 0 0 0 5 00 0 0 0]
[0 0 0 0 0 51 0 0 0]
[0 0 0 0 0 22 1 0 0]
[0 0 0 0 5 00 6 0 0]
[0 0 0 0 0 00 0 3 0]
[0 0 0 0 0 00 0 0 6]]
19 | P a g e
EXPERIMENT - 05
Aim : To perform decision tree regression for the given dataset.
Description :
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the
corresponding output is in the training data) where the data is continuously split according to a certain parameter. The
tree can be explained by two entities, namely decision nodes and leaves.
• Decision tree algorithm falls under the category of supervised learning. They can be used to solve both
regression and classification problems.
• Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class
label and attributes are represented on the internal node of the tree.
• We can represent any boolean function on discrete attributes using the decision tree.
Dataset Description :
The data set 'Height_Age_Dataset.csv' contains the heights an age of different people. The variables in the data set
are:
Height – height of the person in cm
Age – age of the person
Code :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, RobustScaler
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import seaborn as sb
Age Height
0 10 138
1 11 138
2 12 138
3 13 139
4 14 139
#Store the data in the form of dependent and independent variables separately
X = data.iloc[:, 0:1].values
y = data.iloc[:, 1].values
20 | P a g e
#Import the Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
#Fit the decision tree regressor with training data represented by X_train and y_train
DtReg.fit(X_train, y_train)
DecisionTreeRegressor(random_state=0)
''' Visualise the Decision Tree Regression by creating range of values from min value of X_tr
having a difference of 0.01 between two consecutive values'''
X_val = np.arange(min(X_train), max(X_train), 0.01)
#Reshape the data into a len(X_val)*1 array in order to make a column out of the X_val values
X_val = X_val.reshape((len(X_val), 1))
21 | P a g e
#Import export_graphviz package
from sklearn.tree import export_graphviz
#Store the decision tree in a tree.dot file in order to visualize the plot.
#Visualize it on http://www.webgraphviz.com/ by copying and pasting related data from dtregre
export_graphviz(DtReg, out_file ='dtregression.dot',
feature_names =['Age'])
22 | P a g e
\
EXPERIMENT -06
Aim : To perform Linear Discriminant Analysis for the given dataset.
Description :
Linear Discriminant Analysis (LDA) is one of the commonly used sdimensionality reduction techniques in machine
learning to solve more than two-class classification problems. It is also known as Normal Discriminant Analysis
(NDA) or Discriminant Function Analysis (DFA).This can be used to project the features of higher dimensional space
into lower-dimensional space in order to reduce resources and dimensional costs. In this topic, "Linear Discriminant
Analysis (LDA) in machine learning”, we will discuss the LDA algorithm for classification predictive modeling
problems, limitation of logistic regression, representation of linear Discriminant analysis model, how to make a
prediction using LDA, how to prepare data for LDA, extensions to LDA and much more.
Although the logistic regression algorithm is limited to only two-class, linear Discriminant analysis is applicable for
more than two classes of classification problems.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques used for supervised
classification problems in machine learning. It is also considered a pre-processing step for modeling differences in ML
and applications of pattern classification.
Dataset Description :
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic
Problems, and can also be found on the UCI Machine Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each flower. One flower
species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
• Id
• SepalLengthCm
• SepalWidthCm
• PetalLengthCm
• PetalWidthCm
• Species
Code :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
23 | P a g e
\
\
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
cls = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=cls)
Specifying the column or attribute names and Reading the dataset from URL.
dataset.head()
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
iloc() function is used to retrieve the rows and columns and we divide the dataset into class and
target variable.
print(X,y)
24 | P a g e
\
25 | P a g e
\
\
sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
We preprocess the dataset and divide into train and test sets by using the train_test_split method by
performing testing with 20% data. The remaining 80% will be for training.
Here we preprocess the data as there are no missing values and noise in the data so we directly
perform StandardScaler, It is used to resize the distribution of values so that the mean of the
observed values is 0 and the standard deviation is 1. The labelEncoder is used to encode the
independent call variable i.e, the Iris feature.
fit_transform() method is used to calculate the mean and standard deviation of the feature at a time
it will transform the data points of the feature.
print(X,y)
lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
plt.scatter(
X_train[:,0],X_train[:,1],c=y_train,cmap='rainbow',
alpha=0.7,edgecolors='b'
)
27 | P a g e
<matplotlib.collections.PathCollection at 0x7fad177da290>
Accuracy : 0.9333333333333333
[[ 9 0 0]
[ 0 10 1]
[ 01 9]]
28 | P a g e
EXPERIMENT - 07
Aim : To perform support vector machine analysis for the given dataset.
Description :
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for
Classification as well as Regression problems. However, primarily, it is used for Classification problems in Machine
Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support Vector Machine. The objective of SVM algorithm is to find a
hyperplane in an N-dimensional space that distinctly classifies the data points. The dimension of the hyperplane
depends upon the number of features. If the number of input features is two, then the hyperplane is just a line. If the
number of input features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when the
number of features exceeds three.
Dataset Description :
• Digits Dataset is a part of sklearn library. Sklearn comes loaded with datasets to practice machine learning techniques
and digits is one of them.
• Digits has 64 numerical features(8×8 pixels) and a 10 class target variable(0-9). Digits dataset can be used for
classification as well as clustering. Digits dataset is first step to image recognition.
• Each datapoint is a 8x8 image of a digit.
Classes 10
Samples per class ~180
Samples total 1797
Dimensionality 64
Features integers 0-16
Code :
from sklearn.metrics import classification_report
from sklearn import datasets
from skimage import exposure
import numpy as np
import cv2
import imutils
from sklearn.model_selection import train_test_split
mnist=datasets.load_digits()
traindata,testdata,trainlabels,testlabels=train_test_split(np.array(mnist.data),mnist.target,
traindata,valdata,trainlabels,vallabels=train_test_split(traindata,trainlabels,test_size=0.1,
29 | P a g e
from sklearn.svm import SVC
model= SVC(C=0.5, kernel= 'linear')
model.fit(traindata, trainlabels)
score= model.score(valdata, vallabels)
print (score*100)
98.51851851851852
predictions = model.predict(testdata)
print("EVALUATION ON TESTING DATA")
report = (classification_report(testlabels, predictions, output_dict=True))
print(report)
<matplotlib.axes._subplots.AxesSubplot at 0x7f127e1ca090>
30 | P a g e
EXPERIMENT - 08
Aim : To perform Bayesian Classification for the given dataset.
Description :
Bayesian classification is a probabilistic approach to learning and inference based on a different view of what it means
to learn from data, in which probability is used to represent uncertainty about the relationship being learnt. In numerous
applications, the connection between the attribute set and the class variable is non- deterministic. In other words, we can
say the class label of a test record can’t be assumed with certainty even though its attribute set is the same as some of
the training examples. These circumstances may emerge due to the noisy data or the presence of certain confusing
factors that influence classification, but it is not included in the analysis. Bayesian classification uses Bayes theorem to
predict the occurrence of any event. Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings.
Dataset Description :
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image. Attribute Information:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each
image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Code :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
data= pd.read_csv("cancer_data.csv")
data.head(10)
mean_radius mean_texture mean_perimeter mean_area mean_smoothness diagnosis
0 17.99 10.38 122.80 1001.0 0.11840 0
1 20.57 17.77 132.90 1326.0 0.08474 0
2 19.69 21.25 130.00 1203.0 0.10960 0
3 11.42 20.38 77.58 386.1 0.14250 0
4 20.29 14.34 135.10 1297.0 0.10030 0
5 12.45 15.70 82.57 477.1 0.12780 0
6 18.25 19.98 119.60 1040.0 0.09463 0
7 13.71 20.83 90.20 577.9 0.11890 0
8 13.00 21.82 87.50 519.8 0.12730 0
9 12.46 24.04 83.97 475.9 0.11860 0
31 | P a g e
data["diagnosis"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fb834219c50>
corr= data.iloc[:,:-1].corr(method="pearson")
cmap = sns.diverging_palette(250,354,80,60,center='dark',as_cmap=True)
sns.heatmap(corr, vmax=l, vmin=-.5, cmap=cmap, square=True, linewidths=.2)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb834159150>
32 | P a g e
mean_radius mean_texture mean_smoothness diagnosis
0 17.99 10.38 0.11840 0
1 20.57 17.77 0.08474 0
2 19.69 21.25 0.10960 0
3 11.42 20.38 0.14250 0
4 20.29 14.34 0.10030 0
5 12.45 15.70 0.12780 0
6 18.25 19.98 0.09463 0
7 13.71 20.83 0.11890 0
8 13.00 21.82 0.12730 0
9 12.46 24.04 0.11860 0
prior=[]
0
1
2
cat_mean_radius cat_mean_texture cat_mean_smoothness diagnosis
1
1
1
0
0
1
1
0
1
0
0
0
•
3 0 1 2 0
4 1 0 1 0
5 0 0 2 0
33 | P a g e
6 1 1 1 0
7 0 1 1 0
8 0 1 2 0
9 0 1 1 0
def calculate_likelihood_categorical(df, feat_name, feat_val, Y, label):
feat= list(df.columns)
df = df[df[Y]==label]
p_x_given_y = len(df[df[feat_name]==feat_val]) / len(df)
return p_x_given_y
#calculate prior
prior= calculate_prior(df, Y)
Y_pred = []
# loop over every data sample
for x in X:
# calculate likelihood
labels= sorted(list(df[Y].unique()))
likelihood= [l]*len(labels)
for j in range(len(labels)):
for i in range(len(features)):
likelihood[j] *= calculate_likelihood_categorical(df, features[i], x[i], Y, 1
Y_pred.append(np.argmax(post_prob))
return np.array(Y_pred)
X_test test.iloc[:,:-1].values
Y test test.iloc[:,-1].values
Y_pred = naive_bayes_categorical(train, X=X_test, Y="diagnosis")
[[38 2]
[ 5 69]]
0.9517241379310345
34 | P a g e
EXPERIMENT -09
Aim : To perform perceptron learning of the neural network for the given dataset.
Description :
It is a machine learning algorithm that uses supervised learning of binary classifiers. In Perceptron, the weight coefficient
is automatically learned. Initially, weights are multiplied with input features, and then the decision is made whether the
neuron is fired or not. A neural network is formed when a collection of nodes or neurons are interlinked through synaptic
connections. There are three layers in every artificial neural network – input layer, hidden layer, and output layer. The
input layer that is formed from a collection of several nodes or neurons receives inputs. Every neuron in the network has
a function, and every connection has a weight value associated with it. Inputs then move from the input layer to layer
made from a separate set of neurons – the hidden layer. The output layer gives the final outputs.
The concept of perceptron has a critical role in machine learning. It is used as an algorithm or a linear classifier to
facilitate supervised learning of binary classifiers. Supervised learning is amongst the most researched of learning
problems. A supervised learning sample always consists of an input and a correct/explicit output. The objective of this
learning problem is to use data with correct labels for making predictions on future data, for training a model. Some of
the common problems of supervised learning include classification to predict class labels.
Dataset Description :
The seeds dataset involves the prediction of species given measurements seeds from different varieties of wheat.
There are 201 records and 7 numerical input variables. It is a classification problem with 3 output classes. The scale for
each numeric input value vary, so some data normalization may be required for use with algorithms that weight inputs
like the backpropagation algorithm. Using the Zero Rule algorithm that predicts the most common class value, the
baseline accuracy for the problem is 28.095%.
Code :
from·math·import·exp
from·random·import·seed
from·random·import·random
A function named initialize_network() that creates a new neural network ready for training. It
accepts three parameters, the number of inputs, the number of neurons to have in the hidden layer
and the number of outputs.
# Initialize a network
def initialize_network(n_inputs, n_hidden, n_outputs):
network = list()
hidden_layer = [{'weights':[random() for i in range(n_inputs + 1)]} for i in range(n_hidd
network.append(hidden_layer)
output_layer = [{'weights':[random() for i in range(n_hidden + 1)]} for i in range(n_outp
network.append(output_layer)
return network
seed(1)
network = initialize_network(2, 1, 2)
for layer in network:
print(layer)
35 | P a g e
In this function named activate(). You can see that the function assumes that the bias is the last
weight in the list of weights. Neuron activation is calculated as the weighted sum of the inputs.
#·Calculate·neuron·activation·for·an·input
def·activate(weights,·inputs):
→ activation·=·weights[-1]
→ for·i·in·range(len(weights)-1):
→ → activation·+=·weights[i]·*·inputs[i]
→ return·activation
A function named transfer() that implements the sigmoid equation. The sigmoid activation function
looks like an S shape, it’s also called the logistic function. It can take any input value and produce a
number between 0 and 1 on an S-curve.
# Transfer neuron activation
def transfer(activation):
return 1.0 / (1.0 + exp(-activation))
A function named forward_propagate() that implements the forward propagation for a row of data
from our dataset with our neural network.The function returns the outputs from the last layer also
called the output layer.
# Forward propagate input to a network output
def forward_propagate(network, row):
inputs = row
for layer in network:
new_inputs = []
for neuron in layer:
activation = activate(neuron['weights'], inputs)
neuron['output'] = transfer(activation)
new_inputs.append(neuron['output'])
inputs = new_inputs
return inputs
[0.6629970129852887, 0.7253160725279748]
36 | P a g e
EXPERIMENT -10
Aim : To perform backpropagation of the neural network for the given dataset.
Description :
The Backpropagation algorithm is a supervised learning method for multilayer feed-forward networks from the field of
Artificial Neural Networks.
Feed-forward neural networks are inspired by the information processing of one or more neural cells, called a neuron.
A neuron accepts input signals via its dendrites, which pass the electrical signal down to the cell body. The axon
carries the signal out to synapses, which are the connections of a cell’s axon to other cell’s dendrites.
The principle of the backpropagation approach is to model a given function by modifying internal weightings of input
signals to produce an expected output signal. The system is trained using a supervised learning method, where the
error between the system’s output and a known expected output is presented to the system and used to modify its
internal state.
Technically, the backpropagation algorithm is a method for training the weights in a multilayer feed-forward neural
network. As such, it requires a network structure to be defined of one or more layers where one layer is fully
connected to the next layer. A standard network structure is one input layer, one hidden layer, and one output layer.
Dataset Description :
The seeds dataset involves the prediction of species given measurements seeds from different varieties of wheat.
There are 201 records and 7 numerical input variables. It is a classification problem with 3 output classes. The scale for
each numeric input value vary, so some data normalization may be required for use with algorithms that weight inputs
like the backpropagation algorithm. Using the Zero Rule algorithm that predicts the most common class value, the
baseline accuracy for the problem is 28.095%.
Code :
from math import exp
from random import seed
from random import random
Initialize a network
def initialize_network(n_inputs, n_hidden, n_outputs):
network = list()
hidden_layer = [{'weights':[random() for i in range(n_inputs + 1)]} for i in range(n_hidden
network.append(hidden_layer)
output_layer = [{'weights':[random() for i in range(n_hidden + 1)]} for i in range(n_output
network.append(output_layer)
return network
37 | P a g e
Transfer neuron activation
def transfer(activation):
return 1.0 / (1.0 + exp(-activation))
38 | P a g e
Train a network for a fixed number of epochs
def train_network(network, train, l_rate, n_epoch, n_outputs):
for epoch in range(n_epoch):
sum_error = 0
for row in train:
outputs = forward_propagate(network, row)
expected = [0 for i in range(n_outputs)]
expected[row[-1]] = 1
sum_error += sum([(expected[i]-outputs[i])**2 for i in range(len(expected))])
backward_propagate_error(network, expected)
update_weights(network, row, l_rate)
print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))
39 | P a g e
CASE STUDY :
RANDOM FORESTS
AIM:
The main objective is to build a model which predicts whether a person is having
diabetes or not based on some parameters like glucose levels, insulin levels, etc using
“Random Forest”
CONTENTS:
Random Forest
Random Forest is a popular machine learning algorithm that belongs to
the supervised learning technique. It contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset. Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts
the final output.
Dataset Description:
The dataset is gathered from Kaggle, which is named the Pima Indian Diabetes Dataset(PIDD).
The dataset has many attributes of 768 patients.
The 9th attribute is the class variable of each data point. This class variable shows the outcome
0 and 1 for diabetics which indicates positive or negative for diabetics.
Distribution of Diabetic patient records- We made a model to predict diabetes however the
dataset was slightly imbalanced having around 500 classes labeled as 0 means negative means
no diabetes and 268 labeled as 1 means positive means diabetic.
40 | P a g e
PROCEDURE:
Flowchart
Feature selection
41 | P a g e
Pearson Correlation is the default method of the function “corr”.
We can also create a heatmap using Seaborn to visualize the correlation between the
different columns of our data:
Normalization
42 | P a g e
CODE SNIPPETS:
Deployment of Model (deployment.py):
# Importing essential libraries
import numpy as np
import pandas as pd
import pickle
# Model Building
from sklearn.model_selection import train_test_split
X = df.drop(columns='Outcome')
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
43 | P a g e
from sklearn.ensemble import
RandomForestClassifier classifier =
RandomForestClassifier(n_estimators=20)
classifier.fit(X_train, y_train)
df = pd.read_csv(r'diabetes.csv')
print(df.shape)
# HEADINGS
st.title('Diabetes Checkup')
st.sidebar.header('Patient Data')
st.subheader('Training Data Statistics')
st.write(df.describe())
# X AND Y DATA
x = df.drop(['Outcome'], axis = 1)y =
df.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 0)
# FUNCTION
def user_report():
pregnancies = st.sidebar.slider('Pregnancies', 0,17, 3 )
glucose = st.sidebar.slider('Glucose', 0,200, 120 )
insulin = st.sidebar.slider('Insulin', 0,846, 79 )
44 | P a g e
bmi = st.sidebar.slider('BMI', 0,67, 20 )
age = st.sidebar.slider('Age', 21,88, 33 )
user_report_data = {
'pregnancies':pregnancies,
'glucose':glucose,
'insulin':insulin,
'bmi':bmi,
'age':age
}
data = np.array([[glucose, bp, insulin, bmi]])
report_data = pd.DataFrame(user_report_data, index=[0])
temp=[data, report_data]
return temp
# PATIENT DATA
samp = user_report()
forpkl=samp[0]
user_data=samp[1]
st.subheader('Patient Data')
st.write(user_data)
# MODEL
filename = 'diabetes-prediction-rfc-model.pkl'
rf=pickle.load(open(filename, 'rb'))
user_res = rf.predict(forpkl)
# OUTPUT
st.subheader('Your Report: ')
output=' '
if user_res[0]==0:
output = 'You are not Diabetic'
else:
output = 'You are Diabetic'
st.title(output)
st.subheader('Accuracy: ')
st.write(str(accuracy_score(y_test, rf.predict(x_test))*100)+'%')
45 | P a g e
CONCLUSION:
46 | P a g e