machinelearning
machinelearning
0
Ex. No. 1 Implement a Linear Real dataset. Experiment with
different features in building a model. Tune the model’s hyperparameters
Date:
Aim:
To implement a Linear Real dataset.To experiment with different features in
building a model. To tune the model’s hyperparameters.
Program:
import numpy as np
import matplotlib.pyplot as plt
1
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
2
And graph obtained looks like this:
Result:
Thus the experiment was successfully executed.
3
Ex. No. 2 Implement a binary classification model. Modify the
classification threshold and determine how the modify influence the model.
Experiment with different classification metrics to determine model
effectiveness.
Date:
Aim:
To implement a binary classification model. To modify the classification
threshold and determine how the modify influence the model. Experiment with different
classification metrics to determine model effectiveness.
Program:
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_shape=(60,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
...
# evaluate model with standardized dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5,
verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
# Binary Classification with Sonar Dataset: Baseline
from pandas import read_csv
from tensorflow.keras.models import Sequential
4
from tensorflow.keras.layers import Dense
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_shape=(60,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(model=create_baseline, epochs=100, batch_size=5,
verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
5
Output:
Baseline: 81.68% (7.26%)
Result:
Thus the experiment was successfully executed.
6
Ex. No. 3 Classification with nearest neighbors. In this question, you will
use the scikit-learn’s KNN classifier to classify real vs. fake news headlines.
The aim of this question is for you to read the scikit-learn API an get
comfortable with training/validation splits. Use California Housing dataset.
Date:
Aim:
To classify with nearest neighbors. The aim of this question is to read the
scikit-learn API an get comfortable with training/validation splits using California Housing
dataset.
Steps:
In [1]:
#importing libreries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
#importing dataset
ch = pd.read_csv('../input/california-housing/housing.csv')
ch.head()
Out[2]:
long lati housing_ total_ total_b popu hous median median_h ocean_
itud tud median_a room edroo latio ehol _incom ouse_valu proximi
e e ge s ms n ds e e ty
-
37. 322. 126. NEAR
0 122. 41.0 880.0 129.0 8.3252 452600.0
88 0 0 BAY
23
7
long lati housing_ total_ total_b popu hous median median_h ocean_
itud tud median_a room edroo latio ehol _incom ouse_valu proximi
e e ge s ms n ds e e ty
-
37. 7099. 2401 1138 NEAR
1 122. 21.0 1106.0 8.3014 358500.0
86 0 .0 .0 BAY
22
-
37. 1467. 496. 177. NEAR
2 122. 52.0 190.0 7.2574 352100.0
85 0 0 0 BAY
24
-
37. 1274. 558. 219. NEAR
3 122. 52.0 235.0 5.6431 341300.0
85 0 0 0 BAY
25
-
37. 1627. 565. 259. NEAR
4 122. 52.0 280.0 3.8462 342200.0
85 0 0 0 BAY
25
This dataset was derived from the 1990 U.S. census, using one row per census block group. A
block group is the smallest geographical unit for which the U.S. Census Bureau publishes
sample data (a block group typically has a population of 600 to 3,000 people).
We have:
In [3]:
8
ch.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
All columns are numerical except fot 'ocean_proximity'
'total_bedrooms' have some null values (we will fix both)
We are going to switch all of null values with the mean for a better prediction.
In [4]:
#fixing data null with mean
ch = ch.fillna(ch.mean())
ch.isna().sum()
Out[4]:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
What you could do is choose whether to delete the lines with the null value or replace the
value. What I did was to replace the null value with an average because I believe that
eliminating too many lines can affect our analysis and forecast.
In [5]:
ch.describe()
Out[5]:
9
housing_ total_b median median_h
longit latitud total_r popul house
median_a edroo _incom ouse_val
ude e ooms ation holds
ge ms e ue
m - 2635. 1425.
35.63 28.63948 537.87 499.5 3.8706 206855.8
ea 119.5 76308 47674
1861 6 0553 39680 71 16909
n 69704 1 4
2181. 1132.
st 2.003 2.135 12.58555 419.26 382.3 1.8998 115395.6
61525 46212
d 532 952 8 6592 29753 22 15874
2 2
-
m 32.54 2.000 1.0000 3.000 1.000 0.4999 14999.00
124.3 1.000000
in 0000 000 00 000 000 00 0000
50000
- 1447.
25 33.93 18.00000 297.00 787.0 280.0 2.5634 119600.0
121.8 75000
% 0000 0 0000 00000 00000 00 00000
00000 0
- 2127. 1166.
50 34.26 29.00000 438.00 409.0 3.5348 179700.0
118.4 00000 00000
% 0000 0 0000 00000 00 00000
90000 0 0
- 3148. 1725.
75 37.71 37.00000 643.25 605.0 4.7432 264725.0
118.0 00000 00000
% 0000 0 0000 00000 50 00000
10000 0 0
10
housing_ total_b median median_h
longit latitud total_r popul house
median_a edroo _incom ouse_val
ude e ooms ation holds
ge ms e ue
In [6]:
ch.hist(bins=50, figsize=(10, 10)) #visualizing an histogram
Out[6]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ecdf750>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eae0350>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eb179d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1eac3b90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea8f710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea44d90>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1ea054d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1e9bba90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fec1e9bbad0>]],
dtype=object)
Result:
Thus the experiment was successfully executed.
Ex. No. 4 Experiment with validation sets an test sets using the
dataset. Split a training set into a smaller training set and validation set.
Analyse deltas between training set an validation set results. Test the
trained model is overfitting.
Date:
Aim:
11
Experiment with validation sets an test sets using the dataset. To split a
training set into a smaller training set and validation set. To analyse deltas between training
set an validation set results. To test the trained model is overfitting.
Steps:
Training Set
This is the actual dataset from which a model trains .i.e. the model sees and learns from this
data to predict the outcome or to make the right decisions. Most of the training data is
collected from several resources and then preprocessed and organized to provide proper
performance of the model. Type of training data hugely determines the ability of the model
to generalize .i.e. the better the quality and diversity of training data, the better will be the
performance of the model. This data is more than 60% of the total data available for the
project.
Program:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
train_size=0.8,
12
random_state=42)
# Training set
print("Training set x: ",x_train)
print("Training set y: ",y_train)
Output:
Training set x: [[ 0 1]
[14 15]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
Training set y: [0, 7, 2, 4, 3, 6]
Testing Set
This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model, used
only after the training of the model is complete. Testing set is usually a properly organized
dataset having all kinds of data for scenarios that the model would probably be facing when
used in the real world. Often the validation and testing set combined is used as a testing set
which is not considered a good practice. If the accuracy of the model on training data is
greater than that on testing data then the model is said to have overfitting. This data is
approximately 20-25% of the total data available for the project.
Program:
# Importing numpy & scikit-learn
import numpy as np
from sklearn.model_selection import train_test_split
13
# y is just a list of 0-7 number representing
# target variable
y = range(8)
test_size=0.2,
random_state=42)
# Testing set
print("Testing set x: ", x_test)
print("Testing set y: ", y_test)
Output:
Testing set x: [[ 2 3]
[10 11]]
Testing set y: [1, 5]
Result:
Thus the experiment was successfully implemented.
Aim:
14
To Implement the k-means algorithm using
https://archive.ics.uci.eu/ml/atasets/Codon+usage dataset.
Steps:
import numpy as np
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
k=3
clusters = {}
np.random.seed(23)
15
for idx in range(k):
center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
clusters[idx] = cluster
clusters
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.show()
Output:
Aim:
17
To implement the Naïve Bayes Classifier using
https://archive.ics.uci.eu/ml/datasets/Gait+Classification dataset.
Abstract: Gait is considered a biometric criterion. Therefore, we tried to classify people with
gait analysis with this gait data set.
Data Set Number of
Multivariate 48 Area: Computer
Characteristics: Instances:
Missing Number of
Associated Tasks: Classification N/A 38151
Values? Web Hits:
Data Set Information:
This data set was created by calculating the walking parameters of a total of 16 different
volunteers, 7 female and 9 male. The volunteers of 16 volunteers ranged between 20 and 34
years old, and their weight ranged from 53 to 95. In order to calculate each walking
parameter, people were asked to walk the 30-meter long course for three rounds. The shared
file contains X and Y data. X represents gait data and y represents person information for the
relevant sample.
Implementation of the Naïve Bayes algorithm:
Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the
"user_data" dataset, which we have used in our other classification model. Therefore we can
easily compare the Naive Bayes model with the other models.
Steps to implement:
o Data Pre-processing step
o Fitting Naive Bayes to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
1) Data Pre-processing step:
In this step, we will pre-process/prepare the data so that we can use it efficiently in our code.
It is similar as we did in data-pre-processing. The code for this is given below:
1. Importing the libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. # Importing the dataset
18
7. dataset = pd.read_csv('user_data.csv')
8. x = dataset.iloc[:, [2, 3]].values
9. y = dataset.iloc[:, 4].values
10.
11. # Splitting the dataset into the Training set and Test set
12. from sklearn.model_selection import train_test_split
13. x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
14.
15. # Feature Scaling
16. from sklearn.preprocessing import StandardScaler
17. sc = StandardScaler()
18. x_train = sc.fit_transform(x_train)
19. x_test = sc.transform(x_test)
In the above code, we have loaded the dataset into our program using "dataset =
pd.read_csv('user_data.csv'). The loaded dataset is divided into training and test set, and
then we have scaled the feature variable.
The output for the dataset is given as:
19
2) Fitting Naive Bayes to the Training Set:
After the pre-processing step, now we will fit the Naive Bayes model to the Training set.
Below is the code for it:
1. # Fitting Naive Bayes to the Training set
20
2. from sklearn.naive_bayes import GaussianNB
3. classifier = GaussianNB()
4. classifier.fit(x_train, y_train)
In the above code, we have used the GaussianNB classifier to fit it to the training dataset.
We can also use other classifiers as per our requirement.
Output:
Out[6]: GaussianNB(priors=None, var_smoothing=1e-09)
3) Prediction of the test set result:
Now we will predict the test set result. For this, we will create a new predictor
variable y_pred, and will use the predict function to make the predictions.
1. # Predicting the Test set results
2. y_pred = classifier.predict(x_test)
Output:
21
The above output shows the result for prediction vector y_pred and real vector y_test. We
can see that some predications are different from the real values, which are the incorrect
predictions.
4) Creating Confusion Matrix:
Now we will check the accuracy of the Naive Bayes classifier using the Confusion matrix.
Below is the code for it:
1. # Making the Confusion Matrix
2. from sklearn.metrics import confusion_matrix
3. cm = confusion_matrix(y_test, y_pred)
22
Output:
As we can see in the above confusion matrix output, there are 7+3= 10 incorrect predictions,
and 65+25=90 correct predictions.
5) Visualizing the training set result:
Next we will visualize the training set result using Naïve Bayes Classifier. Below is the code
for it:
1. # Visualising the Training set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, st
ep = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
23
16. mtp.legend()
17. mtp.show()
Output:
In the above output we can see that the Naïve Bayes classifier has segregated the data points
with the fine boundary. It is Gaussian curve as we have used GaussianNB classifier in our
code.
6) Visualizing the Test set result:
1. # Visualising the Test set results
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. X1, X2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, st
ep = 0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
24
17. mtp.show()
Output:
Result:
25