ML Lab File
ML Lab File
Lab File
1
INDEX
S.No. Exercise
1 ML_Exercise_1
2 ML_Exercise_2
3 ML_Exercise_3
4 ML_Exercise_4
5 ML_Exercise_5
2
Exercise 1
1. Using head () function print the raw values of the data, ie, top n rows.
2. Using tail () function print the values of the data i.e., last n rows.
3. Check the dimensionality of the data by using the shape () function.
4. Get each attributes data type by using dtypes property.
5. Find the statistical summary of data with the help of describe () method.
In [4]:
import pandas as pd
In [5]:
data = pd.read_csv('/Users/mohitchoudhary/Desktop/train.csv')
In [6]:
data.head(5)
Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
In [7]:
data.tail(5)
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [8]:
data.shape
Out[8]:
(891, 12)
In [9]:
data.dtypes
Out[9]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
In [10]:
data.describe()
Out[10]:
In [ ]:
In [ ]:
Exercise 2
1. Plot histogram for datasets using hist() function.
2. Plot density plots for a dataset for understanding the attribute distribution.
3. Plot box and whisker plots for a dataset for understanding the attribute distribution.
4. Plot multivariate plots(corelation matrix plot and scattered matrix plot) for a dataset.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
data = pd.read_csv('/Users/mohitchoudhary/Desktop/train.csv')
In [3]:
data.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
In [4]:
data.plot(kind = 'hist', subplots = True, layout = (3,3))
plt.show()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
In [5]:
data.plot(kind = 'density', subplots = True, layout = (3,3))
plt.show()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
In [6]:
data.boxplot(figsize = (10,10))
plt.show()
In [7]:
corr = data.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap="Greens",annot=True)
Out[7]:
<AxesSubplot:>
In [8]:
sns.pairplot(data=data)
plt.show()
Ridge Regression
In a multiple LR, there are many variables at play. This sometimes poses a problem of choosing the wrong variables for the ML, which gives undesirable output as a result. Ridge regression is used in order to
overcome this. This method is a regularisation technique in which an extra variable (tuning parameter) is added and optimised to offset the effect of multiple variables in LR (in the statistical context, it is referred
to as „noise‟).
Ridge regression essentially is an instance of LR with regularisation. Mathematically, the model with ridge regression is given by
Y = XB + e
where Y is the dependent variable(label), X is the independent variable (features), B represents all the regression coefficients and e represents the residuals (the extra variables‟ effect). Based on this, the
variables are now standardised by subtracting the respective means and dividing by their standard deviations.
The tuning parameter is now included in the ridge regression model as part of regularisation. It is denoted by the symbol ƛ. Higher the value of ƛ, the residual sum of squares tend to be zero. Lower the ƛ, the
solutions conform to least square method. In simpler words, this parameter decides the effect of coefficients. ƛ is found out using a technique called cross-validation. (More mathematical details on ridge
regression can be found here).
Lasso Regression
Least absolute shrinkage and selection operator, abbreviated as LASSO or lasso, is an LR technique which also performs regularisation on variables in consideration. In fact, it almost shares a similar statistical
analysis evident in ridge regression, except it differs in the regularisation values. This means, it considers the absolute values of the sum of the regression coefficients (hence the term was coined on this
„shrinkage‟ feature). It even sets the coefficients to zero thus reducing the errors completely. In the ridge equation mentioned earlier, the „e‟ component has absolute values instead of squared values.
This method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimises the residual sum of squares to the sum of the absolute value of the coefficients
being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.
In his journal article titled Regression Shrinkage and Selection via the Lasso, Tibshirani gives an account of this technique with respect to various other statistical models such as subset selection and ridge
regression. He goes on to say that lasso can even be extended to generalised regression models and tree-based models. In fact, this technique provides possibilities of even conducting statistical estimations.
Ridge Regression
In [1]: from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
In [ ]:
Lasso Regression
In [1]: from sklearn.datasets import load_boston
from sklearn.linear_model import Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
In [3]: model=Lasso().fit(x, y)
print(model)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
Lasso()
Out[3]: Lasso()
In [ ]:
Exercise 4
1. Perform Scaling on the dataset using MinMaxScaler class/library.
2. Perform Normalization of the data by using Normalizer class/library
A. L1 Normalization
B. L2 Normalization
3. Perform binarization on the dataset using Binarize class/library.
4. Perform standardization on the data using StandardScaler class/library.
In [1]:
# Scaling the data using MinMaxScaler
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
print(scaled)
[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[1. 0. ]
[0.04166667 0.49494949]
[0.47916667 0.04040404]
[0.875 0.6969697 ]
[0. 1. ]]
In [4]:
# L1 normalization
from sklearn.preprocessing import Normalizer
data = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
transformer = Normalizer(norm='l1').fit(data)
l1_normalized = transformer.transform(data)
print(l1_normalized)
In [5]:
# L2 normalization
from sklearn.preprocessing import Normalizer
data = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
transformer = Normalizer(norm='l2').fit(data)
l1_normalized = transformer.transform(data)
print(l1_normalized)
# Binarization
from sklearn.preprocessing import binarize
data = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
binarized_data = binarize(data)
print(binarized_data)
[[1. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]
In [3]:
# Standardizing the data using StandardScaler
from numpy import asarray
from sklearn.preprocessing import StandardScaler
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)
[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[ 1.26398112 -1.16389967]
[-1.06174414 0.12639634]
[ 0. -1.05856939]
[ 0.96062565 0.65304778]
[-1.16286263 1.44302493]]
Exercise 5
1. Select features from a dataset using Univariate Selection.
2. Select features from a dataset using Recursive Feature Elimination.
3. Select features from a dataset using Principal Component Analysis.
4. Select features from a dataset using Feature Importance.
In [1]:
In [3]:
Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 4 5 6 1 1 3]
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/sklearn/utils/validation.py:67: Fu tureWarning: Pass n_features_to_select=3 as
keyword args. From version 0.25 passing these as positional arguments will result in an
error
warnings.warn("Pass {} as keyword args. From version 0.25 "
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/sklearn/linear_model/_logistic.py: 762: ConvergenceWarning: lbfgs failed to
converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
In [4]:
# Feature Extraction using PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)
In [5]:
# Feature Extraction using Feature Importance
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, Y)
print(model.feature_importances_)
In [ ]:
Exercise 6
1. Using K-means clustering algorithm perform clustering on any two datasets
2. Using mean shift clustering algorithm perform clustering on any two datasets
3. Using gaussian mixture model perform clustering on any two datasets
4. Provide a comparison between the various clustering algorithms on different datasets
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MeanShift
from sklearn.mixture import GaussianMixture
In [2]:
# K-Means on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[2]:
<matplotlib.collections.PathCollection at 0x7f832c67e250>
In [3]:
# K-Means on second dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[3]:
<matplotlib.collections.PathCollection at 0x7f832c47df70>
In [4]:
# Mean Shift on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
meanshift = MeanShift()
meanshift.fit(x)
y = meanshift.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[4]:
<matplotlib.collections.PathCollection at 0x7f832c865ee0>
In [5]:
# Mean Shift on second dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
meanshift = MeanShift()
meanshift.fit(x)
y = meanshift.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[5]:
<matplotlib.collections.PathCollection at 0x7f832c6e2940>
In [6]:
# Gaussian Mixture on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
gmm = GaussianMixture(n_components=3)
gmm.fit(x)
y = gmm.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[6]:
<matplotlib.collections.PathCollection at 0x7f83271526a0>
In [7]:
<matplotlib.collections.PathCollection at 0x7f832c0884c0>
K-Means
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-
overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra -
cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns
data points to a cluster such that the sum of the squared distance between the data points and the cluster‟s
centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation
we have within clusters, the more homogeneous (similar) the data points are within the same cluster. The way
kmeans algorithm works is as follows:
Mean Shift
Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms,
clustering attempts to group data without having first been train on labeled data. Clustering is used in a wide
variety of applications such as search engines, academic rankings and medicine. As opposed to K-Means, when
using Mean Shift, you don‟t need to know the number of categories (clusters) beforehand. The downside to
Mean Shift is that it is computationally expensive — O(n²).
How it works
1. Define a window (bandwidth of the kernel) and place the window on a data point.
2. Calculate the mean for all the points in the window.
3. Move the center of the window to the location of the mean.
4. Repeat steps 2 and 3 until there is convergence.