0% found this document useful (0 votes)
7 views51 pages

MLC Practical

Uploaded by

gargic1606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views51 pages

MLC Practical

Uploaded by

gargic1606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

EXPERIMENT NO: 1

Aim: Introduction to python - Installation, operators, decision making, loops

Theory: Python is an interpreted, object-oriented, high-level programming


language. It was created by Guido van Rossum, and released in 1991. Python is
simple, easy to learn syntax emphasises readability and therefore reduces the
cost of program maintenance.
Python is widely used in various fields, including web development, data
analysis, artificial intelligence, scientific computing, and more.
Features:
● Interpreted
- There are no separate compilation and execution steps like C and C++.
- Directly run the program from the source code.
- Internally, Python converts the source code into an intermediate form
called bytecodes which is then translated into native language of a
specific computer to run it.
- No need to worry about linking and loading with libraries, etc.
● Platform Independent
- Python programs can be developed and executed on multiple operating
system platforms.
- Python can be used on Linux, Windows, Macintosh, Solaris and many
more.
● High-level Language
- In Python, there is no need to take care about low-level details such as
managing the memory used by the program.
● Simple
- Closer to English language;Easy to Learn
- More emphasis on the solution to the problem rather than the syntax

Installation of Python in Windows 11:


Step 1: Go to the official Python website:
https://www.python.org/downloads/
Click on the "Download Python" button. The website will automatically
suggest the best version for your system.

Step 2: Run the Installer:


Once the download is complete, open the installer file.
In the installer window, make sure to check the box that says "Add Python to
PATH". This will allow you to run Python from the command line.
Click "Install Now".

Step 3: Verify the Installation:

1
Open Command Prompt.
Type python --version and press Enter. You should see the version of Python
that you installed.

Basic Syntax and Concepts of Python:

Basic Syntax

Operators

2
3
4
EXPERIMENT NO: 2

Aim: To study Python libraries used for machine learning

PANDAS
Pandas is a popular Python library for data analysis. It is not directly related
to Machine Learning. As we know that the dataset must be prepared before
training. In this case, Pandas comes handy as it was developed specifically for
data extraction and preparation. It provides high-level data structures and a
wide variety of tools for data analysis. It provides many inbuilt methods for
grouping, combining and filtering data.

NUMPY
NumPy is a very popular python library for large multi-dimensional array and
matrix processing, with the help of a large collection of high-level
mathematical functions. It is very useful for fundamental scientific
computations in Machine Learning. It is particularly useful for linear algebra,
Fourier transform, and random number capabilities. High-end libraries like
TensorFlow use NumPy internally for manipulation of Tensors.

5
MATPLOTLIB
Matplotlib is a very popular Python library for data visualisation. Like Pandas,
it is not directly related to Machine Learning. It particularly comes in handy
when a programmer wants to visualise the patterns in the data. It is a 2D
plotting library used for creating 2D graphs and plots. A module named pyplot
makes it easy for programmers for plotting as it provides features to control
line styles, font properties, formatting axes, etc. It provides various kinds of
graphs and plots for data visualisation, viz., histogram, error charts, bar
charts, etc,

6
SCIPY
SciPy is a very popular library among Machine Learning enthusiasts as it
contains different modules for optimization, linear algebra, integration and
statistics. There is a difference between the SciPy library and the SciPy stack.
The SciPy is one of the core packages that make up the SciPy stack. SciPy is
also very useful for image manipulation.

7
SCIKIT_LEARN
Scikit-learn is one of the most popular ML libraries for classical ML
algorithms. It is built on top of two basic Python libraries, viz., NumPy and
SciPy. Scikit-learn supports most of the supervised and unsupervised
learning algorithms. Scikit-learn can also be used for data-mining and
data-analysis, which makes it a great tool for those starting out with ML.

8
EXPERIMENT NO: 3

Aim: To study missing and filling data parts in python

Theory:
In data analysis, missing data is a frequent issue that can hinder the accuracy
of models and insights. Handling this data properly is essential to ensure data
quality and reliable results. In Python, libraries like Pandas provide powerful
tools to detect and manage missing values.

1. Identifying Missing Data:


Missing data is usually represented as ǸaǸ (Not a Number). Before taking
action, it is important to identify where these missing values occur in the
dataset.

2. Filling Missing Data:


There are several techniques to fill missing data:
- Constant Value Filling: Missing values can be replaced with a specific
constant, like zero or another relevant placeholder.
- Forward and Backward Filling: Missing values can be filled by propagating
the previous or next valid data point.
- Imputation: More advanced methods involve replacing missing data with
statistical measures like the mean, median, or mode of the column.

3. Dropping Missing Data:


Sometimes, it may be more effective to remove rows or columns with
missing data altogether, especially when the proportion of missing data is
significant.

By applying these techniques, analysts can ensure the dataset is complete and
suitable for further analysis, improving the overall accuracy and performance
of models.

Program:

[1] import pandas as pd


import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = ['a', 'c', 'e', 'f',
'h'], columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

one two three


a -0.587407 0.245445 -1.157601
b NaN NaN NaN

9
c -0.595357 -0.062141 -0.679225
d NaN NaN NaN
e 0.910208 -1.230797 0.191110
f -0.062459 0.092898 1.320681
g NaN NaN NaN
h 1.313131 -0.963366 0.358444

[2] import pandas as pd


import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = ['a', 'c', 'e', 'f',
'h'], columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].isnull())

a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool

[3] import pandas as pd


import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = ['a', 'c', 'e', 'f',
'h'], columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].notnull())

a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool

[4] import pandas as pd


import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = ['a', 'c', 'e', 'f',
'h'], columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].sum())

2.6321528307002513

10
[5] import pandas as pd
import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = [1, 2, 3, 4, 5],
columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].sum())

0.0

[6] import pandas as pd


import numpy as np
df = pd.DataFrame (np.random.randn(3, 3), index = ['a', 'c', 'e'],
columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print (df)
print ("NaN replaced with '0': ")
print (df.fillna(0))

one two three


a 2.452051 -0.233571 -1.000200
b NaN NaN NaN
c -1.410490 0.543543 0.262501
NaN replaced with '0':
one two three
a 2.452051 -0.233571 -1.000200
b 0.000000 0.000000 0.000000
c -1.410490 0.543543 0.262501

[7] import pandas as pd


import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = ['a', 'c', 'e', 'f',
'h'], columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].isnull())

11
EXPERIMENT NO: 4

Aim: To write a program using Python to implement Linear Regression


(Single Variable and Multivariable)

Theory:
Linear regression is a fundamental statistical method used to model the
relationship between a dependent variable (target) and one or more
independent variables (predictors). It is one of the most basic and widely used
forms of predictive modeling in machine learning and statistics.

[1] import pandas as pd


import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

[2] df = pd.read_csv('homeprice.csv')

[3] df

Area Price
0 1000 230000
1 1300 270000
2 3000 620000
3 2600 570000
4 3200 660000
5 2100 510000

[4] plt.xlabel('Area')
plt.ylabel('Price')
plt.scatter(df.Area, df.Price, color='red', marker='+')

<matplotlib.collections.PathCollection at 0xec17800>

12
[5] x = df.iloc[:, 0].values.reshape(-1, 1)
x

array([[1000],
[1300],
[3000],
[2600],
[3200],
[2100]], dtype=int64)

[6] y = df.iloc[:, 1].values.reshape(-1,1)


y

array([[230000],
[270000],
[620000],
[570000],
[660000],
[510000]], dtype=int64)

[7] reg = linear_model.LinearRegression()


reg.fit(x, y)

LinearRegression()

13
Predict price of home with area = 3300 sq.ft.

[8] reg.predict([[3300]])

array([[697208.53858785]])

[9] reg.coef_

array([[200.49261084]])

[10] reg.intercept_

array([35582.9228243])

Y = m*x + b (m: coefficient, b: intercept)

[11] 3300*200.49261084 + 35582.92282430234

697208.5385963024

Predict price of home with area = 5000 sq.ft.

[12] reg.predict([[5000]])

array([[1038045.97701149]])

[13] y_pred = reg.predict(x)

Regression Line

[14] plt.scatter(x, y)
plt.plot(x, y_pred,color='red')
plt.show()

14
Generate CSV file with list of home price predictions.

[15] area_df = pd.read_csv("area.csv")

[16] area_df

Area
0 1100
1 1600
2 2000
3 2200
4 2400
5 2800
6 3400
7 4000

[17] p = reg.predict(area_df)

15
C:\Users\admin\anaconda3\Lib\site-packages\sklearn\base.py:486: UserWarning: X
has feature names, but LinearRegression was fitted without feature names
warnings.warn

[18] p

array([[256124.79474548],
[356371.1001642 ],
[436568.14449918],
[476666.66666667],
[516765.18883415],
[596962.23316913],
[717257.79967159],
[837553.36617406]])

[19] area_df['price']=p

[20] area_df

Area price
0 1100 256124.794745
1 1600 356371.100164
2 2000 436568.144499
3 2200 476666.666667
4 2400 516765.188834
5 2800 596962.233169
6 3400 717257.799672
7 4000 837553.366174

area_df.to_csv("prediction.csv")

16
EXPERIMENT NO: 5

Aim: To study and perform Multiple Linear Regression model.

Theory:
Simple Linear Regression, where a single Independent/Predictor(X) variable is
used to model the response variable (Y). But there may be various cases in
which the response variable is affected by more than one predictor variable;
for such cases, the Multiple Linear Regression algorithm is used. Moreover,
Multiple Linear Regression is an extension of Simple Linear regression as it
takes more than one predictor variable to predict the response variable. We
can define it as:
Definition: Multiple Linear Regression is one of the important regression
algorithms which models the linear relationship between a single dependent
continuous variable and more than one independent variable.

Example:
Prediction of CO2 emission based on engine size and number of cylinders in a
car.

Multiple Linear Regression


In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of
Simple Linear Regression, so the same is applied for the multiple linear
regression equation, the equation becomes:

Implementation of Multiple Linear Regression model using Python: To


implement MLR using Python, we have below problem: We have a dataset of 6
home prices. This dataset contains four main information: Area, Bedroom, Age
& Price. To create a model that can easily determine which house has a
maximum price, and which is the most affecting factor for the profit of a
house. Since we need to find the Profit, so it is the dependent variable, and the
other three variables are independent variables.

1. Data Pre-processing Steps


2. Fitting the MLR model to the training set
3. Predicting the result of the test set

17
18
19
EXPERIMENT NO: 6

Aim: To write a program using python to implement logistic regression.

Problem Statement: Predicting if a person would buy life insurance based on


his age using logistic regression.

Theory:

Logistic Regression

Logistic regression is a supervised machine learning algorithm used for


classification tasks where the goal is to predict the probability that an
instance belongs to a given class or not.

Logistic regression is used for binary classification where we use sigmoid


function, that takes input as independent variables and produces a
probability value between 0 and 1.

● Logistic regression predicts the output of a categorical dependent


variable. Therefore, the outcome must be a categorical or discrete value.
● It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving
the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
● In Logistic regression, instead of fitting a regression line, we fit an “S”
shaped logistic function, which predicts two maximum values (0 or 1).

Logistic Function – Sigmoid Function

The sigmoid function is a mathematical function used to map the predicted


values to probabilities.

It maps any real value into another value within a range of 0 and 1. The value
of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the “S” form.

The S-form curve is called the Sigmoid function or the logistic function.

Types of Logistic Regression

On the basis of the categories, Logistic Regression can be classified into three
types:

Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

20
Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”

Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered


types of dependent variables, such as “low”, “Medium”, or “High”.

Program:

In [ ]: import pandas as pd

import numpy as np

from sklearn import linear_model

from google.colab import files

uploaded = files.upload()

Upload widget is only available when the cell has been executed in the current
browser session. Please rerun this cell to enable.

Saving insurance_data.csv to insurance_data.csv

In [ ]: df = pd.read_csv('insurance_data.csv')

df

Out[ ]:

21
In [ ]: import matplotlib.pyplot as plt

plt.scatter(df.Age,df.Bought_Insurance,marker='+',color='red')

Out[ ]:
<matplotlib.collections.PathCollection at 0x7ab58c242fb0>

In [ ]: from sklearn.model_selection import train_test_split

In [ ]: x_train,x_test,y_train,y_test =
train_test_split(df[['Age']],df.Bought_Insurance,train_size=0.8)

In [ ]: x_test

Out[ ]:

22
In [ ]: x_train

Out[ ]:

23
In [ ]: from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(x_train,y_train)

Out[ ]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this
page with nbviewer.org.
In [ ]: y_predicted = model.predict(x_test)

In [14]: model.predict_proba(x_test)

Out[14]:

array([[0.04246871, 0.95753129],

[0.19494106, 0.80505894],

[0.04246871, 0.95753129],

[0.97525075, 0.02474925],

[0.96265717, 0.03734283],

[0.13674865, 0.86325135]])

In [15]: from sklearn.metrics import confusion_matrix


from sklearn.metrics import classification_report

Find the results [TP FP FN TN]

In [16]: confusion_matrix(y_test,y_predicted)
Out[16]:

array([[1, 1],

[1, 3]])

In [17]: print("Classification Report")

print(classification_report(y_test,y_predicted))

24
Classification Report

precision recall f1-score support

0 0.50 0.50 0.50 2

1 0.75 0.75 0.75 4

accuracy 0.67 6

macro avg 0.62 0.62 0.62 6

weighted avg 0.67 0.67 0.67 6

25
EXPERIMENT NO: 7

Aim: To implement logistic regression for multi-classs classification.

Theory:
Logistic regression can be extended to handle multiclass classification
problems using several approaches. Unlike binary logistic regression, which
deals with two classes, multiclass classification involves more than two
classes.

Program:

In [ ]: from sklearn.datasets import load_digits

In [ ]: import matplotlib.pyplot as plt

In [ ]: digits=load_digits()

In [ ]: digits.data.shape

Out[ ]: (1797, 64)


In [ ]: plt.gray()

for i in range(5):

plt.matshow(digits.images[i])

<Figure size 640x480 with 0 Axes>

26
27
In [ ]: dir(digits)

Out[ ]: ['DESCR', 'data', 'feature_names', 'frame', 'images', 'target',


'target_names']

28
In [ ]: digits.DESCR[0]

Out[ ]: '.'
In [ ]: digits.data[0]

Out[ ]: array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,

15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,

12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,

0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,

10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
In [ ]: digits.target[0]

Out[ ]: 0
In [ ]: digits.target_names[0]

Out[ ]: 0
Create and train logistic regression model.
In [ ]: from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

In [ ]: from sklearn.model_selection import train_test_split

In [ ]: x_train, x_test, y_train, y_test = train_test_split(digits.data,


digits.target, test_size=0.2)

In [ ]: model.fit(x_train, y_train)

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:460:
ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

29
n_iter_i = _check_optimize_result(

Out[ ]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this
page with nbviewer.org.
In [ ]: x_test

Out[ ]: array([[ 0., 0., 12., ..., 2., 0., 0.],

[ 0., 0., 0., ..., 0., 0., 0.],

[ 0., 0., 0., ..., 11., 1., 0.],

...,

[ 0., 0., 7., ..., 9., 1., 0.],

[ 0., 0., 4., ..., 3., 0., 0.],

[ 0., 0., 10., ..., 12., 4., 0.]])


In [ ]: model.score(x_test, y_test)

Out[ ]: 0.9638888888888889
In [ ]: model.predict(digits.data[0:10])

Out[ ]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


In [ ]: y_pred = model.predict(x_test)

In [ ]: from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

Out[ ]: array([[39, 0, 0, 0, 0, 1, 0, 0, 0, 0],

[ 0, 38, 0, 0, 0, 0, 0, 0, 0, 0],

[ 0, 1, 35, 0, 0, 0, 0, 0, 0, 0],

[ 0, 0, 0, 36, 0, 1, 0, 1, 2, 0],

[ 0, 0, 0, 0, 29, 0, 0, 0, 0, 0],

[ 0, 1, 0, 0, 0, 30, 0, 0, 0, 0],

[ 0, 0, 0, 0, 0, 0, 36, 0, 0, 0],

30
[ 0, 1, 0, 0, 0, 0, 0, 31, 0, 1],

[ 0, 0, 0, 0, 0, 0, 0, 0, 39, 0],

[ 0, 1, 0, 0, 0, 1, 0, 0, 2, 34]])
In [ ]: from sklearn.metrics import classification_report

print("Classification Report: \n", classification_report(y_test,


y_pred))

Classification Report:

precision recall f1-score support

0 1.00 0.97 0.99 40

1 0.90 1.00 0.95 38

2 1.00 0.97 0.99 36

3 1.00 0.90 0.95 40

4 1.00 1.00 1.00 29

5 0.91 0.97 0.94 31

6 1.00 1.00 1.00 36

7 0.97 0.94 0.95 33

8 0.91 1.00 0.95 39

9 0.97 0.89 0.93 38

accuracy 0.96 360

macro avg 0.97 0.96 0.96 360

weighted avg 0.97 0.96 0.96 360

31
EXPERIMENT NO: 8

Aim: To study Naive Bayes using Machine Learning

Theory:

Naive Bayes is a simple yet powerful algorithm for classification based on


Bayes' Theorem. It assumes that the features in the dataset are independent of
each other, hence the term "naive." Despite this simplification, it often
performs well in many real-world applications, particularly in text
classification problems.

Bayes' Theorem
The foundation of Naive Bayes lies in Bayes' Theorem, which helps calculate
the probability of a hypothesis (label) given some evidence (features). The
formula for Bayes' Theorem is:

[ P(H|E) = (P(E|H) * P(H)) / P(E) ]

Where:

● P(H|E) is the posterior probability, the probability of the hypothesis


(class) (H) being true given the evidence (features) (E).
● P(E|H) is the likelihood, the probability of observing the evidence (E)
given the hypothesis (H).
● P(H) is the prior probability of the hypothesis, representing how
common the hypothesis is.
● P(E) is the marginal likelihood, the total probability of the evidence.

The Naive Assumption


The algorithm assumes that all features are independent of each other. This
simplifies the calculations, as the joint probability (P(E|H)) can be broken
down into the product of individual probabilities:

[ P(E|H) = P(e_1|H) \cdot P(e_2|H) \cdot \ldots \cdot P(e_n|H) ]

This independence assumption is rarely true in practice, but Naive Bayes still
works well in many cases due to its simplicity and efficiency.

Types of Naive Bayes Classifiers

1. Gaussian Naive Bayes: Assumes that the features follow a normal


(Gaussian) distribution, often used when dealing with continuous data.
2. Multinomial Naive Bayes: Works well for discrete data, especially for
text classification where the features are word counts or frequencies.

32
3. Bernoulli Naive Bayes: Suitable for binary/boolean data, often used
when the features are represented as binary values (e.g., the presence or
absence of a word in text classification).

Applications
Naive Bayes is commonly used in:

● Spam filtering: Classifying emails as spam or not based on the


occurrence of specific words.
● Sentiment analysis: Determining whether a given text has a positive or
negative sentiment.
● Document classification: Categorizing documents into predefined
categories.

Program:
In [4]: from sklearn.naive_bayes import GaussianNB

from sklearn.naive_bayes import MultinomialNB

from sklearn import datasets

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

x = iris.data

y = iris.target

x_train, x_test, y_train, y_test = train_test_split(x, y,


test_size=0.3, random_state=0)

gnb = GaussianNB()

mnb = MultinomialNB()

y_pred_gnb = gnb.fit(x_train, y_train).predict(x_test)

cnf_matrix_gnb = confusion_matrix(y_test, y_pred_gnb)

cnf_matrix_gnb

Out[4]: array([[16, 0, 0],

[ 0, 18, 0],

33
[ 0, 0, 11]])
In [5]: from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_gnb))

In [6]: ans = gnb.predict([[5, 3, 1.2, 2]])

ans

Out[6]: array([1])
In [7]: from sklearn.datasets import load_iris

import pandas as pd

iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['target'] = iris.target

X = iris.data

df.sample(4)

34
In [9]: df['species'] = pd.Categorical.from_codes(iris.target,
iris.target_names)
df.head()

In [10]: df['species'] = pd.Categorical.from_codes(iris.target,


iris.target_names)

df.tail()

35
EXPERIMENT NO:9

Aim: Introduction and study of K means algorithm

Theory:

Unsupervised Machine Learning is the process of teaching a computer to use


unlabeled, unclassified data and enabling the algorithm to operate on that
data without supervision. Without any previous data training, the machine’s
job in this case is to organize unsorted data according to parallels, patterns,
and variations.

K means clustering, assigns data points to one of the K clusters depending on


their distance from the center of the clusters. It starts by randomly assigning
the clusters centroid in the space. Then each data point assign to one of the
cluster based on its distance from centroid of the cluster. After assigning each
point to one of the cluster, new cluster centroids are assigned. This process
runs iteratively until it finds good cluster. In the analysis we assume that
number of cluster is given in advanced and we have to put points in one of the
group.

In some cases, K is not clearly defined, and we have to think about the optimal
number of K. K Means clustering performs best data is well separated. When
data points overlapped this clustering is not suitable. K Means is faster as
compare to other clustering technique. It provides strong coupling between
the data points. K Means cluster do not provide clear information regarding
the quality of clusters. Different initial assignment of cluster centroid may
lead to different clusters. Also, K Means algorithm is sensitive to noise. It may
have stuck in local minima.

What is the objective of k-means clustering?

The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are more
comparable to one another and different from the data points within the other
groups. It is essentially a grouping of things based on how similar and
different they are to one another.

How k-means clustering works?


We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To
achieve this, we will use the K-means algorithm, an unsupervised learning
algorithm. ‘K’ in the name of the algorithm represents the number of
groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The

36
algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity, we will use the Euclidean distance as a measurement.
The algorithm works as follows:

First, we randomly initialize k points, called means or cluster centroids. We


categorize each item to its closest mean, and we update the mean’s
coordinates, which are the averages of the items categorized in that cluster so
far.
We repeat the process for a given number of iterations and at the end, we have
our clusters.

Program:
In [ ]: import matplotlib. pyplot as plt
import numpy as np
from sklearn .cluster import KMeans
x = np.array([[5,3], [10,15], [15,12], [24,10], [30,45],
[85,70], [71,80], [60,78], [55,52], [80,91]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(x)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
plt.scatter(x[:,0], x[:,1], label = 'trueposition')

[1 1 1 1 1 0 0 0 0 0] [[70.2 74.2] [16.8 17. ]]

Out[ ]:
<matplotlib.collections.PathCollection at 0x78428cb42080>

37
In [ ]: kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
print(kmeans.cluster_centers_)

[[70.2 74.2]
[16.8 17. ]]

In [ ]: print(kmeans.labels_)

Out[ ]: [1 1 1 1 1 0 0 0 0 0]

In [ ]: plt.scatter(x[:,0], x[:,1], c=kmeans.labels_,


cmap='rainbow')

Out[ ]:
<matplotlib.collections.PathCollection at 0x7842876e5e70>

In [ ]: plt.scatter(x[:,0], x[:,1], c=kmeans.labels_,


cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0],
kmeans.cluster_centers_[:,1], color='black')

Out[ ]:
<matplotlib.collections.PathCollection at 0x784287a62d10>

38
In [2]: import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
plt.scatter(x[:,0], x[:,1], label = 'TruePosition')

Out[2]:
<matplotlib.collections.PathCollection at 0x7b286e9dd720>

39
In [3]: kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
print(kmeans.cluster_centers_)
Out[ ]: [[6.30103093 2.88659794 4.95876289 1.69587629]
[5.00566038 3.36981132 1.56037736 0.29056604]]

In [5]: print(kmeans.labels_)

Out[ ]:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0]

In [6]: plt.scatter(x[:,0], x[:,1], c=kmeans.labels_,


cmap='rainbow')

Out[6]:
<matplotlib.collections.PathCollection at 0x7b2868a699f0>

40
In [7]: kmeans = KMeans(n_clusters=3)
kmeans.fit(x)
print(kmeans.cluster_centers_)
Out[ ]: [[6.85384615 3.07692308 5.71538462 2.05384615]
[5.006 3.428 1.462 0.246 ]
[5.88360656 2.74098361 4.38852459 1.43442623]]

In [8]: print(kmeans.labels_)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2
2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0
0
0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2
0
0 2]

41
EXPERIMENT NO 10:

Aim: To study and perform Principal Component Analysis

Theory:
As the number of features or dimensions in a dataset increases, the amount of
data required to obtain a statistically significant result increases
exponentially. This can lead to issues such as overfitting, increased
computation time, and reduced accuracy of machine learning models. This is
known as the curse of dimensionality problems that arise while working with
high-dimensional data.

As the number of dimensions increases, the number of possible combinations


of features increases exponentially, which makes it computationally difficult
to obtain a representative sample of the data. It becomes expensive to perform
tasks such as clustering or classification because the algorithms need to
process a much larger feature space, which increases computation time and
complexity. Additionally, some machine learning algorithms can be sensitive
to the number of dimensions, requiring more data to achieve the same level of
accuracy as lower-dimensional data.

To address the curse of dimensionality, Feature engineering techniques are


used which include feature selection and feature extraction. Dimensionality
reduction is a type of feature extraction technique that aims to reduce the
number of input features while retaining as much of the original information
as possible.

In this article, we will discuss one of the most popular dimensionality


reduction techniques i.e. Principal Component Analysis(PCA).

Program:

Step 1: Import the necessary libraries


In [17]: import numpy as np
from numpy import linalg as la

Step 2: Give the input dataset.


In [2]: x = np.array([2.5, 0.5, 2.2, 1.9, 3.1, 2.3, 2, 1, 1.5, 1.1])
y = np.array([2.4, 0.7, 2.9, 2.2, 3, 2.7, 1.6, 1.1, 1.6, 0.9])
data = np.array([x, y])
print(x)
print(y)

42
print(data)

[2.5 0.5 2.2 1.9 3.1 2.3 2. 1. 1.5 1.1]


[2.4 0.7 2.9 2.2 3. 2.7 1.6 1.1 1.6 0.9]
[[2.5 0.5 2.2 1.9 3.1 2.3 2. 1. 1.5 1.1]
[2.4 0.7 2.9 2.2 3. 2.7 1.6 1.1 1.6 0.9]]

In [3]: xMean = np.mean(x)


yMean = np.mean(y)
print(xMean)
print(yMean)

Out [3]: 1.81


1.9100000000000001

In [4]: data.shape

Out[4]: (2, 10)


Step 3: Compute the mean adjusted values by subtracting each point from its mean.
In [5]: meanAdjusted = np.zeros((2, 10))
for i in range(len(data[0])):
meanAdjusted[0][i] = data[0][i] - xMean
for i in range(len(data[1])):
meanAdjusted[1][i] = data[1][i] - yMean
print(meanAdjusted)

Out [5]:[[ 0.69 -1.31 0.39 0.09 1.29 0.49 0.19 -0.81 -0.31 -0.71]
[ 0.49 -1.21 0.99 0.29 1.09 0.79 -0.31 -0.81 -0.31 -1.01]]

Step 4: Compute the covariance matrix of the mean adjusted data


In [6]: cov_mat = np.cov(meanAdjusted)
print(cov_mat)

Out [6]: [[0.61655556 0.61544444]


[0.61544444 0.71655556]]

Step 5: Compute the eigen values and eigen vectors.


In [7]: eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

43
Out [7]: Eigenvectors
[[-0.73517866 -0.6778734 ]
[ 0.6778734 -0.73517866]]
Eigenvalues
[0.0490834 1.28402771]

Step 6: Arrange the eigenvalues in descending order.


In [8]: eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in
range(len(eig_vals))]
eig_pairs.sort()
eig_pairs.reverse()
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])

Out [8]: Eigenvalues in descending order:


1.2840277121727839
0.04908339893832736

In [10]: print('Eigenvectors in descending order: ')


for i in eig_pairs:
print(i[1])

Out [10]: Eigenvectors in descending order:


[-0.6778734 -0.73517866]
[-0.73517866 0.6778734 ]

In [11]: eig_pairs [0][1]

Out[11]: array([-0.6778734 , -0.73517866])

Step 7: Retaining only those eigenvalues having maximum values, tranform data and display.

In [12]: transformedData1 = np.matmul (meanAdjusted.T, eig_pairs[0][1])


transformedData2 = np.matmul (meanAdjusted.T, eig_pairs[1][1])
print(transformedData1)
print(transformedData2)

Out [12]: [-0.82797019 1.77758033 -0.99219749 -0.27421042 -1.67580142


-0.9129491 0.09910944 1.14457216 0.43804614 1.22382056]
[-0.17511531 0.14285723 0.38437499 0.13041721 -0.20949846
0.17528244 -0.3498247 0.04641726 0.01776463 -0.16267529]

44
In [13]: transformedData = [transformedData1, transformedData2]
transformedData = np.transpose(transformedData)
print(transformedData)

Out [13]:
[[-0.82797019 -0.17511531]
[ 1.77758033 0.14285723]
[-0.99219749 0.38437499]
[-0.27421042 0.13041721]
[-1.67580142 -0.20949846]
[-0.9129491 0.17528244]
[ 0.09910944 -0.3498247 ]
[ 1.14457216 0.04641726]
[ 0.43804614 0.01776463]
[ 1.22382056 -0.16267529]]

In [14]: matrix_w = np.hstack((eig_pairs[0][1].reshape(2,1),


eig_pairs[1][1].reshape(2,1)))
print('Matrix W:\n', matrix_w)

Out [14]: Matrix W:


[[-0.6778734 -0.73517866]
[-0.73517866 0.6778734 ]]

Step 8: Reconstruct and transform the original data.

In [16]: originalData = np.matmul(transformedData, matrix_w)


originalData[:][:] = originalData[:][:] + np.array([xMean, yMean])
print(originalData)

Out [16]:
[[2.5 2.4]
[0.5 0.7]
[2.2 2.9]
[1.9 2.2]
[3.1 3. ]
[2.3 2.7]
[2. 1.6]
[1. 1.1]
[1.5 1.6]
[1.1 0.9]]

45
EXPERIMENT NO: 11

In [1]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import pylab as pl

In [7]: x1 = np.arange(0, 10)


y1 = np.arange(10, 0, -1)

In [8]: plt.plot(x1,y1)

Out[8]:
[<matplotlib.lines.Line2D at 0x7f1efc951060>]

In [9]: np.cov([x1,y1])

Out[9]:
array([[ 9.16666667, -9.16666667],
[-9.16666667, 9.16666667]])

In [10]: x2 = np.arange(0,10)

46
y2 = np.array([2]*10)
plt.plot(x2,y2)

Out[10]:
[<matplotlib.lines.Line2D at 0x7f1efc7694b0>]

In [11]: cov_mat = np.cov([x2,y2])


cov_mat
Out[11]:
array([[9.16666667, 0. ],
[0. , 0. ]])

In [12]: x3 = np.array([2]*10)
y3 = np.arange(0,10)
plt.plot(x3,y3)

Out[12]:
[<matplotlib.lines.Line2D at 0x7f1efc5d7640>]

47
In [13]: np.cov([x3,y3])

Out[13]:
array([[0. , 0. ],
[0. , 9.16666667]])

In [14]: iris = load_iris()

In [15]: iris_df = pd.DataFrame(iris.data,columns=[iris.feature_names])


iris_df.head()

Out[15]:

48
In [16]: X = iris.data
X.shape

Out[16]: (150, 4)

In [18]: from sklearn.preprocessing import StandardScaler


X_std = StandardScaler().fit_transform(X)
print(X_std[0:5])
print("The shape of Feature Matrix is -",X_std.shape)
Out[18]:

[[-0.90068117 1.01900435 -1.34022653 -1.3154443 ]


[-1.14301691 -0.13197948 -1.34022653 -1.3154443 ]
[-1.38535265 0.32841405 -1.39706395 -1.3154443 ]
[-1.50652052 0.09821729 -1.2833891 -1.3154443 ]
[-1.02184904 1.24920112 -1.34022653 -1.3154443 ]]
The shape of Feature Matrix is - (150, 4)

In [19]: X_covariance_matrix = np.cov(X_std.T)


X_covariance_matrix

Out[19]:
array([[ 1.00671141, -0.11835884, 0.87760447, 0.82343066],
[-0.11835884, 1.00671141, -0.43131554, -0.36858315],
[ 0.87760447, -0.43131554, 1.00671141, 0.96932762],
[ 0.82343066, -0.36858315, 0.96932762, 1.00671141]])

In [20]: eig_vals, eig_vecs = np.linalg.eig(X_covariance_matrix)


print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

Eigenvectors
[[ 0.52106591 -0.37741762 -0.71956635 0.26128628]
[-0.26934744 -0.92329566 0.24438178 -0.12350962]
[ 0.5804131 -0.02449161 0.14212637 -0.80144925]
[ 0.56485654 -0.06694199 0.63427274 0.52359713]]

Eigenvalues
[2.93808505 0.9201649 0.14774182 0.02085386]

In [21]: eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in


range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low


eig_pairs.sort(key=lambda x: x[0], reverse=True)

49
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])

Out[9]:
Eigenvalues in descending order:
2.938085050199995
0.9201649041624864
0.1477418210449475
0.020853862176462696

In [23]: tot = sum(eig_vals)


var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Variance captured by each component is \n",var_exp)
print(40 * '-')
print("Cumulative variance captured as we travel each component
\n",cum_var_exp)

Variance captured by each component is


[72.96244541329989, 22.850761786701753, 3.668921889282865, 0.5178709107154905]
----------------------------------------
Cumulative variance captured as we travel each component
[ 72.96244541 95.8132072 99.48212909 100. ]

In [24]: matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),


eig_pairs[1][1].reshape(4,1)))
print ('Matrix W:\n', matrix_w)

Out[24]:
Matrix W:
[[ 0.52106591 -0.37741762]
[-0.26934744 -0.92329566]
[ 0.5804131 -0.02449161]
[ 0.56485654 -0.06694199]]

In [25]: Y = X_std.dot(matrix_w)
print (Y[0:5])

Out[25]:
[[-2.26470281 -0.4800266 ]
[-2.08096115 0.67413356]
[-2.36422905 0.34190802]
[-2.29938422 0.59739451]
[-2.38984217 -0.64683538]]

50
In [28]: pl.figure()
target_names = iris.target_names
y = iris.target
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
pl.scatter(Y[y==i,0], Y[y==i,1], c=c, label=target_name)
pl.xlabel('Principal Component 1')
pl.ylabel('Principal Component 2')
pl.legend()
pl.title('PCA of IRIS dataset')
pl.show()

Out[28]:

51

You might also like