0% found this document useful (0 votes)

43 views19 pages

ML Lab File

This document contains a machine learning lab file for a student named Mohit Kumar Choudhary studying Computer Science and Engineering. The file includes 5 exercises on introductory machine learning concepts like data loading, visualization, and summary statistics. The exercises demonstrate functions like head(), tail(), shape(), dtypes, describe(), histograms, density plots, box plots, and correlation matrices on the Titanic dataset. The file also lists the course and lab coordinators for the student's machine learning course.

Uploaded by

sonu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views19 pages

ML Lab File

Uploaded by

sonu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Machine Learning

Lab File

Name:- Mohit Kumar Choudhary

Branch:- CSE
Entry number:- 18BCS042

Course Coordinator:- Dr. Sakshi Arora

Lab Coordinator:- Miss Vippon Preet Kour

1
INDEX

S.No. Exercise
1 ML_Exercise_1

2 ML_Exercise_2

3 ML_Exercise_3

4 ML_Exercise_4

5 ML_Exercise_5

2
Exercise 1
1. Using head () function print the raw values of the data, ie, top n rows.
2. Using tail () function print the values of the data i.e., last n rows.
3. Check the dimensionality of the data by using the shape () function.
4. Get each attributes data type by using dtypes property.
5. Find the statistical summary of data with the help of describe () method.

In [4]:
import pandas as pd

In [5]:
data = pd.read_csv('/Users/mohitchoudhary/Desktop/train.csv')

In [6]:

data.head(5)
Out[6]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Braund, Mr. Owen

0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Harris

Cumings, Mrs. John

1 2 1 1 Bradle y (Florence female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques

3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Heath (Lily May Peel)

Allen, Mr. William

4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Henry

In [7]:
data.tail(5)

Out[7]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S

Graham, Miss. Margaret

887 888 1 1 female 19.0 0 0 112053 30.00 B42 S
Edith

Johnston, Miss. W./C.

888 889 0 3 female NaN 1 2 23.45 NaN S
Catherine Helen "Carrie" 6607

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

In [8]:
data.shape
Out[8]:

(891, 12)
In [9]:

data.dtypes
Out[9]:

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

In [10]:
data.describe()
Out[10]:

PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [ ]:

In [ ]:
Exercise 2
1. Plot histogram for datasets using hist() function.
2. Plot density plots for a dataset for understanding the attribute distribution.
3. Plot box and whisker plots for a dataset for understanding the attribute distribution.
4. Plot multivariate plots(corelation matrix plot and scattered matrix plot) for a dataset.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('/Users/mohitchoudhary/Desktop/train.csv')

In [3]:
data.head()
Out[3]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Braund, Mr. Owen

0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Harris

Cumings, Mrs. John

1 2 1 1 Bradle y (Florence female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques

3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Heath (Lily May Peel)

Allen, Mr. William

4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Henry

In [4]:
data.plot(kind = 'hist', subplots = True, layout = (3,3))
plt.show()

/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
In [5]:
data.plot(kind = 'density', subplots = True, layout = (3,3))
plt.show()

In [6]:

data.boxplot(figsize = (10,10))
plt.show()
In [7]:
corr = data.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap="Greens",annot=True)

Out[7]:

<AxesSubplot:>

In [8]:

sns.pairplot(data=data)
plt.show()
Ridge Regression
In a multiple LR, there are many variables at play. This sometimes poses a problem of choosing the wrong variables for the ML, which gives undesirable output as a result. Ridge regression is used in order to
overcome this. This method is a regularisation technique in which an extra variable (tuning parameter) is added and optimised to offset the effect of multiple variables in LR (in the statistical context, it is referred
to as „noise‟).

Ridge regression essentially is an instance of LR with regularisation. Mathematically, the model with ridge regression is given by

Y = XB + e

where Y is the dependent variable(label), X is the independent variable (features), B represents all the regression coefficients and e represents the residuals (the extra variables‟ effect). Based on this, the
variables are now standardised by subtracting the respective means and dividing by their standard deviations.

The tuning parameter is now included in the ridge regression model as part of regularisation. It is denoted by the symbol ƛ. Higher the value of ƛ, the residual sum of squares tend to be zero. Lower the ƛ, the
solutions conform to least square method. In simpler words, this parameter decides the effect of coefficients. ƛ is found out using a technique called cross-validation. (More mathematical details on ridge
regression can be found here).

Lasso Regression
Least absolute shrinkage and selection operator, abbreviated as LASSO or lasso, is an LR technique which also performs regularisation on variables in consideration. In fact, it almost shares a similar statistical
analysis evident in ridge regression, except it differs in the regularisation values. This means, it considers the absolute values of the sum of the regression coefficients (hence the term was coined on this
„shrinkage‟ feature). It even sets the coefficients to zero thus reducing the errors completely. In the ridge equation mentioned earlier, the „e‟ component has absolute values instead of squared values.

This method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimises the residual sum of squares to the sum of the absolute value of the coefficients
being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.

In his journal article titled Regression Shrinkage and Selection via the Lasso, Tibshirani gives an account of this technique with respect to various other statistical models such as subset selection and ridge
regression. He goes on to say that lasso can even be extended to generalised regression models and tree-based models. In fact, this technique provides possibilities of even conducting statistical estimations.
Ridge Regression
In [1]: from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

In [2]: boston = load_boston()

x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

In [3]: alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1]

In [4]: for a in alphas:

model = Ridge(alpha=a, normalize=True).fit(x,y)
score = model.score(x, y)
pred_y = model.predict(x)
mse = mean_squared_error(y, pred_y)
print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}"
.format(a, score, mse, np.sqrt(mse)))

Alpha:0.000001, R2:0.741, MSE:21.89, RMSE:4.68

Alpha:0.000010, R2:0.741, MSE:21.89, RMSE:4.68
Alpha:0.000100, R2:0.741, MSE:21.89, RMSE:4.68
Alpha:0.001000, R2:0.741, MSE:21.90, RMSE:4.68
Alpha:0.010000, R2:0.740, MSE:21.92, RMSE:4.68
Alpha:0.100000, R2:0.732, MSE:22.66, RMSE:4.76
Alpha:0.500000, R2:0.686, MSE:26.48, RMSE:5.15
Alpha:1.000000, R2:0.635, MSE:30.81, RMSE:5.55

In [5]: ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain)

ypred = ridge_mod.predict(xtest)
score = model.score(xtest,ytest)
mse = mean_squared_error(ytest,ypred)
print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}"
.format(score, mse,np.sqrt(mse)))

R2:0.601, MSE:24.49, RMSE:4.95

In [6]: x_ax = range(len(xtest))

plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In [ ]:
Lasso Regression
In [1]: from sklearn.datasets import load_boston
from sklearn.linear_model import Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

In [2]: boston = load_boston()

x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

In [3]: model=Lasso().fit(x, y)
print(model)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)

Lasso()
Out[3]: Lasso()

In [4]: score = model.score(x, y)

ypred = model.predict(xtest)
mse = mean_squared_error(ytest, ypred)
print("Alpha:{0:.2f}, R2:{1:.2f}, MSE:{2:.2f}, RMSE:{3:.2f}"
.format(model.alpha, score, mse, np.sqrt(mse)))

Alpha:1.00, R2:0.68, MSE:27.89, RMSE:5.28

In [5]: x_ax = range(len(xtest))

plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred,lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In [ ]:
Exercise 4
1. Perform Scaling on the dataset using MinMaxScaler class/library.
2. Perform Normalization of the data by using Normalizer class/library
A. L1 Normalization
B. L2 Normalization
3. Perform binarization on the dataset using Binarize class/library.
4. Perform standardization on the data using StandardScaler class/library.

In [1]:
# Scaling the data using MinMaxScaler
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
print(scaled)

[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[1. 0. ]
[0.04166667 0.49494949]
[0.47916667 0.04040404]
[0.875 0.6969697 ]
[0. 1. ]]

In [4]:
# L1 normalization
from sklearn.preprocessing import Normalizer
data = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
transformer = Normalizer(norm='l1').fit(data)
l1_normalized = transformer.transform(data)
print(l1_normalized)

[[0.44444444 0.11111111 0.22222222 0.22222222]

[0.0625 0.1875 0.5625 0.1875 ]
[0.27777778 0.38888889 0.27777778 0.05555556]]

In [5]:
# L2 normalization
from sklearn.preprocessing import Normalizer
data = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
transformer = Normalizer(norm='l2').fit(data)
l1_normalized = transformer.transform(data)
print(l1_normalized)

[[0.8 0.2 0.4 0.4]

[0.1 0.3 0.9 0.3]
[0.5 0.7 0.5 0.1]]
In [6]:

# Binarization
from sklearn.preprocessing import binarize
data = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
binarized_data = binarize(data)
print(binarized_data)

[[1. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]

In [3]:
# Standardizing the data using StandardScaler
from numpy import asarray
from sklearn.preprocessing import StandardScaler
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)

[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[ 1.26398112 -1.16389967]
[-1.06174414 0.12639634]
[ 0. -1.05856939]
[ 0.96062565 0.65304778]
[-1.16286263 1.44302493]]
Exercise 5
1. Select features from a dataset using Univariate Selection.
2. Select features from a dataset using Recursive Feature Elimination.
3. Select features from a dataset using Principal Component Analysis.
4. Select features from a dataset using Feature Importance.

In [1]:

# Feature Selection using Univariate Selection

from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
filename = '/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
print(features[0:5,:])

[ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]

[[ 6. 148. 33.6 50. ]
[ 1. 85. 26.6 31. ]
[ 8. 183. 23.3 32. ]
[ 1. 89. 28.1 21. ]
[ 0. 137. 43.1 33. ]]

In [3]:

# Feature Extraction using RFE

from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
model = LogisticRegression(solver='lbfgs')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 4 5 6 1 1 3]

/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/sklearn/utils/validation.py:67: Fu tureWarning: Pass n_features_to_select=3 as
keyword args. From version 0.25 passing these as positional arguments will result in an
error
warnings.warn("Pass {} as keyword args. From version 0.25 "
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/sklearn/linear_model/_logistic.py: 762: ConvergenceWarning: lbfgs failed to
converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(

In [4]:
# Feature Extraction using PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]

[[-2.022e-03 9.781e-02 1.609e-02 6.076e-02 9.931e-01 1.401e-02
5.372e-04 -3.565e-03]
[-2.265e-02 -9.722e-01 -1.419e-01 5.786e-02 9.463e-02 -4.697e-02
-8.168e-04 -1.402e-01]
[-2.246e-02 1.434e-01 -9.225e-01 -3.070e-01 2.098e-02 -1.324e-01
-6.400e-04 -1.255e-01]]

In [5]:
# Feature Extraction using Feature Importance
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, Y)
print(model.feature_importances_)

[0.107 0.241 0.096 0.074 0.079 0.133 0.124 0.146]

In [ ]:
Exercise 6
1. Using K-means clustering algorithm perform clustering on any two datasets
2. Using mean shift clustering algorithm perform clustering on any two datasets
3. Using gaussian mixture model perform clustering on any two datasets
4. Provide a comparison between the various clustering algorithms on different datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MeanShift
from sklearn.mixture import GaussianMixture

In [2]:
# K-Means on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')

Out[2]:

<matplotlib.collections.PathCollection at 0x7f832c67e250>

In [3]:
# K-Means on second dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[3]:

<matplotlib.collections.PathCollection at 0x7f832c47df70>
In [4]:
# Mean Shift on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
meanshift = MeanShift()
meanshift.fit(x)
y = meanshift.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[4]:

<matplotlib.collections.PathCollection at 0x7f832c865ee0>

In [5]:
# Mean Shift on second dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
meanshift = MeanShift()
meanshift.fit(x)
y = meanshift.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[5]:

<matplotlib.collections.PathCollection at 0x7f832c6e2940>

In [6]:
# Gaussian Mixture on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
gmm = GaussianMixture(n_components=3)
gmm.fit(x)
y = gmm.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')

Out[6]:

<matplotlib.collections.PathCollection at 0x7f83271526a0>

In [7]:

# Gaussian Mixture on second dataset

df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
gmm = GaussianMixture(n_components=3)
gmm.fit(x)
y = gmm.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[7]:

<matplotlib.collections.PathCollection at 0x7f832c0884c0>

K-Means
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-
overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra -
cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns
data points to a cluster such that the sum of the squared distance between the data points and the cluster‟s
centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation
we have within clusters, the more homogeneous (similar) the data points are within the same cluster. The way
kmeans algorithm works is as follows:

1. Specify number of clusters K.

2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids
without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn‟t
changing.
Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points that belong to each
cluster.

Mean Shift
Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms,
clustering attempts to group data without having first been train on labeled data. Clustering is used in a wide
variety of applications such as search engines, academic rankings and medicine. As opposed to K-Means, when
using Mean Shift, you don‟t need to know the number of categories (clusters) beforehand. The downside to
Mean Shift is that it is computationally expensive — O(n²).

How it works
1. Define a window (bandwidth of the kernel) and place the window on a data point.
2. Calculate the mean for all the points in the window.
3. Move the center of the window to the location of the mean.
4. Repeat steps 2 and 3 until there is convergence.

Guassian Mixture Model

As the name implies, a Gaussian mixture model involves the mixture (i.e. superposition) of multiple Gaussian
distributions. For the sake of explanation, suppose we had three distributions made up of samples from three
distinct classes. The blue Gaussian represents the level of education of people that make up the lower class.
The red Gaussian represents the level of education of people that make up the middle class, and the green
Gaussian represents the level of education of people that make up the upper class. Not knowing what samples
came from which class, our goal will be to use Gaussian Mixture Models to assign the data points to the
appropriate cluster. After training the model, we‟d ideally end up with three distributions on the same axis. Then,
depending on the level of education of a given sample (where it is located on the axis), we‟d place it in one of the
three categories. Every distribution is multiplied by a weight π to account for the fact that we do not have an
equal number of samples from each category. In other words, we might only have included 1000 people from the
upper class and 100,000 people from the middle class. Since, we‟re dealing with probabilities, the weights
should add to 1, when summed. If we decided to add another dimension such as the number of children, then, it
might look something like this.

Maria DB Server Knowledge Base
No ratings yet
Maria DB Server Knowledge Base
3,812 pages
Ang A., Tang W. Probability Concepts in Engineering 2ed 2007
100% (1)
Ang A., Tang W. Probability Concepts in Engineering 2ed 2007
419 pages
Data Preprocessing - Ipynb - Colaboratory
No ratings yet
Data Preprocessing - Ipynb - Colaboratory
7 pages
Book Developing and Assessing Intercultural Communicative Competence
No ratings yet
Book Developing and Assessing Intercultural Communicative Competence
49 pages
Titanic Dataset Model Prediction
No ratings yet
Titanic Dataset Model Prediction
11 pages
Titanic
100% (2)
Titanic
13 pages
✌️???? ????????????✌️???? ??????
No ratings yet
✌️???? ????????????✌️???? ??????
63 pages
Titanic Survival Prediction Ml
No ratings yet
Titanic Survival Prediction Ml
36 pages
Titanic Data Analysis
No ratings yet
Titanic Data Analysis
14 pages
Python for Machine Learning
No ratings yet
Python for Machine Learning
33 pages
Mayank Chaudhary DEV Practicals
No ratings yet
Mayank Chaudhary DEV Practicals
14 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
Unit 5 Analysis with Pandas in python
No ratings yet
Unit 5 Analysis with Pandas in python
26 pages
Learneverythingai 1695069129
No ratings yet
Learneverythingai 1695069129
56 pages
Assignment 5
No ratings yet
Assignment 5
14 pages
Data Cleaning and Manipulation in Python
No ratings yet
Data Cleaning and Manipulation in Python
33 pages
Pyt Manual 1
No ratings yet
Pyt Manual 1
85 pages
Titanic eda
No ratings yet
Titanic eda
17 pages
Dataset Visualization Basic Ml-1
No ratings yet
Dataset Visualization Basic Ml-1
12 pages
Polyhydramnios CASE STUDY: Download Now
100% (1)
Polyhydramnios CASE STUDY: Download Now
14 pages
Data Cleaning by Manish Batra 1697684636
No ratings yet
Data Cleaning by Manish Batra 1697684636
30 pages
Copy of AE II Simulation File.pdf
No ratings yet
Copy of AE II Simulation File.pdf
32 pages
Assign8.ipynb - Colab
No ratings yet
Assign8.ipynb - Colab
14 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
28 pages
vertopal.com_homework1
No ratings yet
vertopal.com_homework1
17 pages
23L-2589 Lab 10
No ratings yet
23L-2589 Lab 10
17 pages
AM19 EDA Assignment1
No ratings yet
AM19 EDA Assignment1
13 pages
Machine Learning Notebook
No ratings yet
Machine Learning Notebook
19 pages
2-1
No ratings yet
2-1
24 pages
Ai Tools and Applications-Lab
No ratings yet
Ai Tools and Applications-Lab
33 pages
LOGISTIC_REGRESSION - Jupyter Notebook
No ratings yet
LOGISTIC_REGRESSION - Jupyter Notebook
18 pages
Titanic Survival Prediction 1692609491
No ratings yet
Titanic Survival Prediction 1692609491
15 pages
ml dataset performance
No ratings yet
ml dataset performance
11 pages
Pandas PD: Import As
No ratings yet
Pandas PD: Import As
19 pages
Logistic Regression On Titanic Dataset
No ratings yet
Logistic Regression On Titanic Dataset
6 pages
Python pandas library
No ratings yet
Python pandas library
10 pages
PANDAS groupby continues 2
No ratings yet
PANDAS groupby continues 2
5 pages
Assignment2_DMS672
No ratings yet
Assignment2_DMS672
15 pages
1728086737277
No ratings yet
1728086737277
26 pages
ML File 211173
No ratings yet
ML File 211173
19 pages
20mia1006_lab_4_FDA
No ratings yet
20mia1006_lab_4_FDA
15 pages
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
No ratings yet
Loading The Dataset: ## The Matplotlib and Seaborn Library For Result Visualization and Analysis
13 pages
dspracticalexternak23aug
No ratings yet
dspracticalexternak23aug
8 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
Body Physics Supplementary Material 1558453612
No ratings yet
Body Physics Supplementary Material 1558453612
289 pages
Players Performance Numpy & Titanic SurvivalAnalysis Pandas BUG
No ratings yet
Players Performance Numpy & Titanic SurvivalAnalysis Pandas BUG
4 pages
9 Data Visualization
No ratings yet
9 Data Visualization
3 pages
Print Print Print Print: Import As
No ratings yet
Print Print Print Print: Import As
6 pages
DSDBAAssignment2_SUMEET (1)
No ratings yet
DSDBAAssignment2_SUMEET (1)
8 pages
178 - NaiveBaye's.ipynb - Colab
No ratings yet
178 - NaiveBaye's.ipynb - Colab
3 pages
7 8 - Missing Value Handling
No ratings yet
7 8 - Missing Value Handling
4 pages
Pandas Titanic
No ratings yet
Pandas Titanic
1 page
Maneesha Nidigonda Minor Project .Ipynb
No ratings yet
Maneesha Nidigonda Minor Project .Ipynb
35 pages
PRAC3_23BME053
No ratings yet
PRAC3_23BME053
5 pages
Titanic Data
No ratings yet
Titanic Data
5 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Onkar exp 3 - Jupyter Notebook
No ratings yet
Onkar exp 3 - Jupyter Notebook
2 pages
Essay in English For Students
100% (2)
Essay in English For Students
4 pages
How To Write A History Coursework Introduction
100% (2)
How To Write A History Coursework Introduction
7 pages
AML_LAB12.Ipynb - Colab
No ratings yet
AML_LAB12.Ipynb - Colab
4 pages
day20
No ratings yet
day20
5 pages
day22
No ratings yet
day22
6 pages
Titanic Akshaya
No ratings yet
Titanic Akshaya
12 pages
Chapter 5 DCF With Inflation and Taxation: 1. Objectives
No ratings yet
Chapter 5 DCF With Inflation and Taxation: 1. Objectives
17 pages
The Titanic dataset
No ratings yet
The Titanic dataset
6 pages
decision tree
No ratings yet
decision tree
2 pages
Experiment 6: Power in AC Circuits
No ratings yet
Experiment 6: Power in AC Circuits
13 pages
PE Notes
100% (1)
PE Notes
1 page
The Game of Tarot The Game of Tarot: From Ferrara To Salt Lake City From Ferrara To Salt Lake City
50% (2)
The Game of Tarot The Game of Tarot: From Ferrara To Salt Lake City From Ferrara To Salt Lake City
121 pages
Internship Report: On Broadway Infosys
No ratings yet
Internship Report: On Broadway Infosys
10 pages
WWW Fastenersonline Co in Ansi b16 5 Class 150 LB Flange Dim
No ratings yet
WWW Fastenersonline Co in Ansi b16 5 Class 150 LB Flange Dim
12 pages
Grease and Applications
No ratings yet
Grease and Applications
21 pages
JSWTM 2015 329
No ratings yet
JSWTM 2015 329
6 pages
Resposta Questão Capitulo 8 Fogler
No ratings yet
Resposta Questão Capitulo 8 Fogler
5 pages
Chinese and Tibetan Esoteric Buddhism: Yael Bentor Meir Shahar
No ratings yet
Chinese and Tibetan Esoteric Buddhism: Yael Bentor Meir Shahar
26 pages
ROLE OF Ngo
No ratings yet
ROLE OF Ngo
5 pages
Fire Protection and Arson Investigation
100% (2)
Fire Protection and Arson Investigation
14 pages
About Me: EE 359: Wireless Communications
No ratings yet
About Me: EE 359: Wireless Communications
12 pages
Autumn Hard Waste 2024 Residential A5 GL
No ratings yet
Autumn Hard Waste 2024 Residential A5 GL
4 pages
Useful Phrases For IELTS Writing Task Two
No ratings yet
Useful Phrases For IELTS Writing Task Two
5 pages
5-Day Weather Forecast Valid 20th To 24th April 2024
No ratings yet
5-Day Weather Forecast Valid 20th To 24th April 2024
3 pages
PD Consular & Passport Officer
No ratings yet
PD Consular & Passport Officer
4 pages
Great Expectations Questions
No ratings yet
Great Expectations Questions
2 pages
The Top 3 Soft Skills For IT Professionals
No ratings yet
The Top 3 Soft Skills For IT Professionals
8 pages
come_rain_or_shine
No ratings yet
come_rain_or_shine
3 pages
Health - Lesson 2 - Dimensions of Human Sexuality
No ratings yet
Health - Lesson 2 - Dimensions of Human Sexuality
23 pages
YÖKDİL Fen Paragraf Tamamlama Soru Tipi
No ratings yet
YÖKDİL Fen Paragraf Tamamlama Soru Tipi
9 pages
English: Quarter 2 - Module 1: Making Connections
No ratings yet
English: Quarter 2 - Module 1: Making Connections
46 pages

ML Lab File

Uploaded by

ML Lab File

Uploaded by

Machine Learning

Name:- Mohit Kumar Choudhary

Course Coordinator:- Dr. Sakshi Arora

Braund, Mr. Owen

Cumings, Mrs. John

Futrelle, Mrs. Jacques

Allen, Mr. William

Graham, Miss. Margaret

Johnston, Miss. W./C.

PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Braund, Mr. Owen

Cumings, Mrs. John

Futrelle, Mrs. Jacques

Allen, Mr. William

In [2]: boston = load_boston()

In [3]: alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1]

In [4]: for a in alphas:

Alpha:0.000001, R2:0.741, MSE:21.89, RMSE:4.68

In [5]: ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain)

R2:0.601, MSE:24.49, RMSE:4.95

In [6]: x_ax = range(len(xtest))

In [2]: boston = load_boston()

In [4]: score = model.score(x, y)

Alpha:1.00, R2:0.68, MSE:27.89, RMSE:5.28

In [5]: x_ax = range(len(xtest))

[[0.44444444 0.11111111 0.22222222 0.22222222]

[[0.8 0.2 0.4 0.4]

# Feature Selection using Univariate Selection

[ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]

# Feature Extraction using RFE

Explained Variance: [0.889 0.062 0.026]

[0.107 0.241 0.096 0.074 0.079 0.133 0.124 0.146]

# Gaussian Mixture on second dataset

1. Specify number of clusters K.

Guassian Mixture Model

You might also like