0% found this document useful (0 votes)
20 views95 pages

9 .ML Programs

Uploaded by

abhiramragu51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views95 pages

9 .ML Programs

Uploaded by

abhiramragu51
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

AIM: EXPERIMENT-1
Plotting for Exploratory Data Analysis
(EDA)

Description:
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
data, and possibly formulate hypotheses that might cause new data collection and experiments.
EDA focuses more narrowly on checking assumptions required for model fitting and hypothesis
testing. It also checks while handling missing values and making transformations of variables as
needed.filling the counts with

EDA build a robust understanding of the data, issues associated with either the info or process.
it’s a scientific approach to get the story of the data.

ALGORITHM:
1.Load.csv files

2.Dataset Information

3.Data Cleaning/Wrangling

4.Group by names

5.Summary of Statistics

6.Dealing with Missing Values

7.Skewness and kurtosis

8.Categorical variable Move

9.Create Dummy Variables

10.Removing Columns

11.Univariate Analysis

12.Bivariate Analysis

13.Multi-Variate Analysis

14.Distributions of the variables/features

15.Correlation

In [10]:
import pandas as pd
import seaborn as sns
localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 1/21
12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

import matplotlib.pyplot as plt


import numpy as np
import os
os.getcwd()

Out[10]: 'C:\\Users\\STAFF'

In [11]:
iris=pd.read_csv("C:\\Users\\STAFF\\Desktop\\New folder\\iris.csv")

In [12]:
print(iris.shape)
print(iris.head())

(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [13]:
print(iris.columns)

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',


'species'],
dtype='object')

In [14]:
iris["species"].value_counts()

Out[14]: setosa 50
virginica 50
versicolor 50
Name: species, dtype: int64

In [15]:
iris.plot(kind='scatter',x='sepal_length',y='sepal_width')
plt.show()

Observation
1.

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 2/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [16]:
sns.set_style("whitegrid")
sns.FacetGrid(iris,hue="species",height=4).map(plt.scatter,"sepal_length","sepal_wid
plt.show()

Observation
1. From the above graph we can differentiate the 3 species of iris namely setosa, versicolor
and virginica.
2. These 3 types of iris can be differentiated on the basis of their sepal_length and
sepal_width.
3. In the above graph setosa can be entirely differentiated from versicolor and virginica.
4.

In [17]:
plt.close()
sns.set_style("whitegrid")
sns.pairplot(iris,hue="species",height=3)
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 3/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Obsevation
1. By using this graph we can easily classify setosa from virginica and versicolor.
2. From above graphs it is clear that petal_length and petal_width are 2 important plots.
3. For setosa petal_length is less than 2.
4. For setosa petal_width is 1. 5.

In [18]:
iris=pd.read_csv("C:\\Users\\STAFF\\Desktop\\New folder\\iris.csv")
iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
plt.plot(iris_setosa["petal_length"],np.zeros_like(iris_setosa['petal_length']),'o')
plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica['petal_length']
plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor['petal_length

Out[18]: [<matplotlib.lines.Line2D at 0x1cacbd72340>]

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 4/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [19]:
sns.FacetGrid(iris,hue="species",height=5)\
.map(sns.histplot,"petal_length")\
.add_legend();
plt.show();

In [20]:
sns.FacetGrid(iris,hue="species",height=5)\
.map(sns.histplot,"sepal_width")\
.add_legend();
plt.show();

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 5/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [21]:
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=20,
density = True)
pdf = counts/(sum(counts))
plt.plot(bin_edges[1:],pdf);
plt.show();

[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]


[1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]

In [22]:
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 6/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

plt.plot(bin_edges[1:], cdf)
plt.show();
[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]
[1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]

In [23]:
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_virginica['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_versicolor['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.show();

[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]


[1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
[0.02 0.1 0.24 0.08 0.18 0.16 0.1 0.04 0.02 0.06]
[4.5 4.74 4.98 5.22 5.46 5.7 5.94 6.18 6.42 6.66 6.9 ]
[0.02 0.04 0.06 0.04 0.16 0.14 0.12 0.2 0.14 0.08]
[3. 3.21 3.42 3.63 3.84 4.05 4.26 4.47 4.68 4.89 5.1 ]

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 7/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [24]:
print("Means:")
print(np.mean(iris_setosa["petal_length"]))
print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))

Means:
1.464
2.4156862745098038
5.552
4.26

Std-dev:
0.17176728442867115
0.5463478745268441
0.4651881339845204

In [25]:
print("\nMedians:")
print(np.median(iris_setosa["petal_length"]))
print(np.median(np.append(iris_setosa["petal_length"],50)));
print(np.median(iris_virginica["petal_length"]))
print(np.median(iris_versicolor["petal_length"]))
print("\nQuantiles:")
print(np.percentile(iris_setosa["petal_length"],np.arange(0, 100, 25)))
print(np.percentile(iris_virginica["petal_length"],np.arange(0, 100, 25)))
print(np.percentile(iris_versicolor["petal_length"], np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(iris_setosa["petal_length"],90))
print(np.percentile(iris_virginica["petal_length"],90))
print(np.percentile(iris_versicolor["petal_length"], 90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(iris_setosa["petal_length"]))
print(robust.mad(iris_virginica["petal_length"]))
print(robust.mad(iris_versicolor["petal_length"]))

Medians:
1.5
1.5
5.55
4.35

Quantiles:
[1. 1.4 1.5 1.575]
localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 8/21
12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

[4.5 5.1 5.55 5.875]


[3. 4. 4.35 4.6 ]

90th Percentiles:
1.7
6.31
4.8

Median Absolute Deviation


0.14826022185056031
0.6671709983275211
0.5189107764769602

In [26]:
sns.boxplot(x='species',y='petal_length', data=iris)
plt.show()

In [57]:
sns.violinplot(x="species",y="sepal_width",data=iris,size=8)
plt.show()

In [58]:
sns.violinplot(x="species",y="petal_length",data=iris,size=8)
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 9/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [59]:
sns.violinplot(x="species",y="sepal_length",data=iris,size=8)
plt.show()

In [51]:
sns.violinplot(x="species",y="petal_width",data=iris,size=8)
plt.show()

In [52]:
sns.jointplot(x="petal_length",y="petal_width",data=iris_setosa,kind="kde")
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 10/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [55]:
sns.jointplot(x="sepal_length",y="sepal_width",data=iris_versicolor,kind="kde")
plt.show()

In [56]:
sns.jointplot(x="petal_length",y="sepal_width",data=iris_virginica,kind="kde")
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 11/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [45]:
sns.jointplot(x="sepal_length",y="petal_width",data=iris_setosa,kind="kde")
plt.show()

In [31]:
iris_virginica_SW=iris_virginica.iloc[:,1]
iris_versicolor_SW=iris_versicolor.iloc[:,1]

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 12/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [32]:
from scipy import stats
stats.ks_2samp(iris_virginica_SW,iris_versicolor_SW)

Out[32]: KstestResult(statistic=0.26, pvalue=0.06779471096995852)

In [33]:
x=stats.norm.rvs(loc=0.2,size=10)
stats.kstest(x,'norm')

Out[33]: KstestResult(statistic=0.4494357377033551, pvalue=0.02315422931280886)

In [34]:
x=stats.norm.rvs(loc=0.2,size=100)
stats.kstest(x,'norm')

Out[34]: KstestResult(statistic=0.1518997609250119, pvalue=0.017650084062803217)

In [35]:
x=stats.norm.rvs(loc=0.2,size=1000)
stats.kstest(x,'norm')

Out[35]: KstestResult(statistic=0.09365118098235525, pvalue=4.3916646230907635e-08)

In [21]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [22]:
dataset=pd.read_csv("C:\\Users\\STAFF\\Desktop\\20981a4409\\Data1.csv")
dataset

Out[22]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [23]:
print(dataset)
dataset.isnull()

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[23]: Name Gender Age Salary

0 False False False False

1 False False False True

2 False True True True

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 13/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Name Gender Age Salary

3 False False False True

4 False False True False

In [24]:
x=dataset.iloc[:, :-1].values
x

Out[24]: array([['James', 'male', 30.0],


['Kumar', 'male', 50.0],
['Laxman', nan, nan],
['Rani', 'female', 25.0],
['Prem ', 'male', nan]], dtype=object)

In [25]:
print(x)

[['James' 'male' 30.0]


['Kumar' 'male' 50.0]
['Laxman' nan nan]
['Rani' 'female' 25.0]
['Prem ' 'male' nan]]

In [26]:
y=dataset.iloc[:,3].values

In [27]:
print(y)

[20000. nan nan nan 50000.]

In [28]:
print(x)

[['James' 'male' 30.0]


['Kumar' 'male' 50.0]
['Laxman' nan nan]
['Rani' 'female' 25.0]
['Prem ' 'male' nan]]

In [29]:
dataset

Out[29]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [30]:
dataset[dataset.notnull()]

Out[30]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 14/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Name Gender Age Salary

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [31]:
dataset.isnull().sum()

Out[31]: Name 0
Gender 1
Age 2
Salary 3
dtype: int64

In [32]:
dataset.shape

Out[32]: (5, 4)

In [33]:
df1=dataset.dropna()
df1

Out[33]: Name Gender Age Salary

0 James male 30.0 20000.0

In [34]:
df1.shape

Out[34]: (1, 4)

In [35]:
df2=dataset.dropna(axis='columns')
df2

Out[35]: Name

0 James

1 Kumar

2 Laxman

3 Rani

4 Prem

In [36]:
dataset

Out[36]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 15/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Name Gender Age Salary

4 Prem male NaN 50000.0

In [37]:
df4=dataset.dropna(axis='columns',how='all')
df4

Out[37]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [38]:
print(dataset)
df5=dataset.dropna(axis='rows',thresh=3)
df5

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[38]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [39]:
dataset

Out[39]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [40]:
df6=dataset.fillna(0)
df6

Out[40]: Name Gender Age Salary

0 James male 30.0 20000.0

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 16/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Name Gender Age Salary

1 Kumar male 50.0 0.0

2 Laxman 0 0.0 0.0

3 Rani female 25.0 0.0

4 Prem male 0.0 50000.0

In [41]:
df7=dataset.fillna(1)
df7

Out[41]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 1.0

2 Laxman 1 1.0 1.0

3 Rani female 25.0 1.0

4 Prem male 1.0 50000.0

In [42]:
print(dataset)
df8=dataset.fillna(method="ffill")
df8

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[42]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 20000.0

2 Laxman male 50.0 20000.0

3 Rani female 25.0 20000.0

4 Prem male 25.0 50000.0

In [43]:
print(dataset)
df9=dataset.fillna(method="bfill")
df9

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[43]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 50000.0

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 17/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Name Gender Age Salary

2 Laxman female 25.0 50000.0

3 Rani female 25.0 50000.0

4 Prem male NaN 50000.0

In [45]:
print(dataset)
df11=dataset
df11["Gender"]=dataset["Gender"].fillna("male")
df11

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[45]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman male NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [46]:
dataset.Gender.value_counts()

Out[46]: male 4
female 1
Name: Gender, dtype: int64

In [47]:
df11.Gender.value_counts()

Out[47]: male 4
female 1
Name: Gender, dtype: int64

In [48]:
dataset

Out[48]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman male NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [49]:
dataset1=dataset.fillna(dataset.mean())
dataset1

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 18/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Out[49]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 35000.0

2 Laxman male 35.0 35000.0

3 Rani female 25.0 35000.0

4 Prem male 35.0 50000.0

In [50]:
dataset2=dataset.fillna(dataset.mode())
dataset2

Out[50]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 50000.0

2 Laxman male 50.0 NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [51]:
dataset3=dataset.fillna(dataset.median())
dataset3

Out[51]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 35000.0

2 Laxman male 30.0 35000.0

3 Rani female 25.0 35000.0

4 Prem male 30.0 50000.0

In [55]:
y=dataset.iloc[:,:].values
print(y)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer=imputer.fit(y)
y=imputer.transform(y)
y

[['James' 'male' 30.0 20000.0]


['Kumar' 'male' 50.0 nan]
['Laxman' 'male' nan nan]
['Rani' 'female' 25.0 nan]
['Prem ' 'male' nan 50000.0]]
Out[55]: array([['James', 'male', 30.0, 20000.0],
['Kumar', 'male', 50.0, 20000.0],
['Laxman', 'male', 25.0, 20000.0],
['Rani', 'female', 25.0, 20000.0],
['Prem ', 'male', 25.0, 50000.0]], dtype=object)

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 19/21


12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

In [56]:
z=dataset.iloc[:,2:4].values
print(z)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(z)
z=imputer.transform(z)
z

[[3.0e+01 2.0e+04]
[5.0e+01 nan]
[ nan nan]
[2.5e+01 nan]
[ nan 5.0e+04]]
Out[56]: array([[3.0e+01, 2.0e+04],
[5.0e+01, 3.5e+04],
[3.5e+01, 3.5e+04],
[2.5e+01, 3.5e+04],
[3.5e+01, 5.0e+04]])

In [57]:
x=dataset.iloc[:, :].values
x

Out[57]: array([['James', 'male', 30.0, 20000.0],


['Kumar', 'male', 50.0, nan],
['Laxman', 'male', nan, nan],
['Rani', 'female', 25.0, nan],
['Prem ', 'male', nan, 50000.0]], dtype=object)

In [58]:
from sklearn.preprocessing import LabelEncoder
label_encoder_x=LabelEncoder()
x[:, 1]=label_encoder_x.fit_transform(x[:, 1])
x

Out[58]: array([['James', 1, 30.0, 20000.0],


['Kumar', 1, 50.0, nan],
['Laxman', 1, nan, nan],
['Rani', 0, 25.0, nan],
['Prem ', 1, nan, 50000.0]], dtype=object)

In [59]:
x=dataset.iloc[:, :].values
print(x)

[['James' 'male' 30.0 20000.0]


['Kumar' 'male' 50.0 nan]
['Laxman' 'male' nan nan]
['Rani' 'female' 25.0 nan]
['Prem ' 'male' nan 50000.0]]

In [61]:
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:, 2:])
x[:, 2:]=imputer.transform(x[:, 2:])
print(x)

[['James' 'male' 30.0 20000.0]


['Kumar' 'male' 50.0 35000.0]
['Laxman' 'male' 35.0 35000.0]
['Rani' 'female' 25.0 35000.0]
['Prem ' 'male' 35.0 50000.0]]

In [62]:
x
localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 20/21
12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)

Out[62]: array([['James', 'male', 30.0, 20000.0],


['Kumar', 'male', 50.0, 35000.0],
['Laxman', 'male', 35.0, 35000.0],
['Rani', 'female', 25.0, 35000.0],
['Prem ', 'male', 35.0, 50000.0]], dtype=object)

In [63]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x[:, 2:]=sc.fit_transform(x[:, 2:])

In [64]:
x

Out[64]: array([['James', 'male', -0.5976143046671968, -1.5811388300841898],


['Kumar', 'male', 1.7928429140015905, 0.0],
['Laxman', 'male', 0.0, 0.0],
['Rani', 'female', -1.1952286093343936, 0.0],
['Prem ', 'male', 0.0, 1.5811388300841898]], dtype=object)

In [65]:
from sklearn.preprocessing import LabelEncoder
label_encoder_x=LabelEncoder()
x[:, 1]=label_encoder_x.fit_transform(x[:, 1])
x

Out[65]: array([['James', 1, -0.5976143046671968, -1.5811388300841898],


['Kumar', 1, 1.7928429140015905, 0.0],
['Laxman', 1, 0.0, 0.0],
['Rani', 0, -1.1952286093343936, 0.0],
['Prem ', 1, 0.0, 1.5811388300841898]], dtype=object)

localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 21/21


12/9/22, 7:47 PM exp2 (1)

EXPERIMENT-2
2.STATISTICAL OPERATIONS
STATISTICAL OPERATIONS USING PYTHON
Aim:-
by using python write a numpy program to compute the 1)Expected Value 2)Mean 3)Standard
Deviation 4)Variance 5)Covariance 6)Covariance Matrix Of Two Given Arrays

Description:-
Statistics is a pillar of machine learning. You cannot develop a deep understanding and
application of machine learning without it.

1)Expected Value:-An expected value gives a quick insight into the behavior of a random
variable without knowing if it is discrete or continuous.

2)Mean:-Mean is only the average of all numbers in a paticular numeric variable.

3)Standard Deviation:-It is simply square root of varience.

4)Variance:-It is the expected value of the squared variation of a random variable from its mean
value, in probability and statistics. Informally, variance estimates how far a set of numbers
(random) are spread out from their mean value.

5)Covariance:- It is a measure of the relationship between two random variables and to what
extent, they change together.

6)Covariance Matrix Of Two Given Arrays:-In NumPy for computing the covariance matrix of two
given arrays with help of numpy.cov(). In this, we will pass the two arrays and it will return the
covariance matrix of two given arrays.

Algorithm:-
Step1:-Importing the library files.

Step2:-Load the Dataset.

Step3:-Reading The Collected Dataset.

Step4:-Preprocessing The Data.

Step5:-Perform Statistical Operations i.e, 1)Expected Value. 2)Mean. 3)Standard Deviation.


4)Variance. 5)Covariance. 6)Covariance Matrix Of Two Given Arrays.

Step6:-Model Evaluation.

In [6]:
import statistics
list = [1, 2, 3,3,4,5,]

localhost:8888/nbconvert/html/Desktop/m l record/exp2 (1).ipynb?download=false 1/2


12/9/22, 7:47 PM exp2 (1)

print ("The mean values is : ",end="")


The mean values is : 3

In [7]:
import numpy as np
def expected_value(values, weights):
values = np.asarray(values)
weights = np.asarray(weights)
return (values * weights).sum() / weights.sum()
values = [0, 1, 2, 3, 4]
probs = [.18, .34, .35, .11, .02]
expected_value(values, probs)

Out[7]: 1.4500000000000002

In [8]:
import numpy as np
a = np.array([1, 2, 3, 4])
b=np.mean(a)
print("mean: ",b)
c=np.std(a)
print("std: ",c)
d=np.var(a)
print("variance: ",d)
e=np.cov(a)
print("co-variance: ",e)

mean: 2.5
std: 1.118033988749895
variance: 1.25
co-variance: 1.6666666666666665

In [9]:
array1 = np.array([3, 2])
array2 = np.array([7, 4])
# Covariance matrix
print("\nCovariance matrix :\n",
np.cov(array1, array2))

Covariance matrix :
[[0.5 1.5]
[1.5 4.5]]

localhost:8888/nbconvert/html/Desktop/m l record/exp2 (1).ipynb?download=false 2/2


12/9/22, 7:49 PM DATA PREPROCESSING

AIM : DATAEXPERIMENT-3
PREPROCESSING - continous /
discrete
DATA data
PREPROCESSING-CONTINUOUS /DISCRETE DATA

ALGORITHM :
-> Getting the data set. -> Importing libraries. -> Importing datasets. -> Finding Missing Data. -> Finding
Outliers -> Splitting dataset into training and testing -> Feature scaling.

DESCRIPTION :
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put
in a formatted way. So for this, we use data preprocessing task.

PROGRAM:
1. IMPORT THE LIBRARY FILES

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

1. IMPORTING THE DATASET

In [2]:
dataset = pd.read_csv('C:\\Users\\STAFF\\Downloads\\Data1.csv')
dataset

Out[2]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [3]:
print(dataset)

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 1/9


12/9/22, 7:49 PM DATA PREPROCESSING

1. FINDING MISSING VALUES:

In [4]:
print(dataset)
dataset.isnull()

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[4]: Name Gender Age Salary

0 False False False False

1 False False False True

2 False True True True

3 False False False True

4 False False True False

In [5]:
dataset

Out[5]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [6]:
dataset[dataset.notnull()]

Out[6]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [7]:
dataset.isnull().sum()

Out[7]: Name 0
Gender 1
Age 2
Salary 3
dtype: int64

In [8]:
dataset.shape
localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 2/9
12/9/22, 7:49 PM DATA PREPROCESSING

Out[8]: (5, 4)

In [9]:
df1=dataset.dropna()
df1

Out[9]: Name Gender Age Salary

0 James male 30.0 20000.0

In [10]:
df1.shape

Out[10]: (1, 4)

In [11]:
df2=dataset.dropna(axis='columns')
df2

Out[11]: Name

0 James

1 Kumar

2 Laxman

3 Rani

4 Prem

In [12]:
df3=dataset.dropna(axis=1)
df3

Out[12]: Name

0 James

1 Kumar

2 Laxman

3 Rani

4 Prem

In [13]:
dataset

Out[13]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 3/9


12/9/22, 7:49 PM DATA PREPROCESSING

In [14]:
df4=dataset.dropna(axis='columns',how='all')
df4

Out[14]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [15]:
print(dataset)
df5=dataset.dropna(axis='rows',thresh=3)
df5

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[15]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [16]:
dataset

Out[16]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman NaN NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [17]:
df6=dataset.fillna(0)
df6

Out[17]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 0.0

2 Laxman 0 0.0 0.0

3 Rani female 25.0 0.0

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 4/9


12/9/22, 7:49 PM DATA PREPROCESSING

Name Gender Age Salary

4 Prem male 0.0 50000.0

In [18]:
df7=dataset.fillna(1)
df7

Out[18]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 1.0

2 Laxman 1 1.0 1.0

3 Rani female 25.0 1.0

4 Prem male 1.0 50000.0

In [19]:
print(dataset)
df8=dataset.fillna(method="ffill")
df8

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[19]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 20000.0

2 Laxman male 50.0 20000.0

3 Rani female 25.0 20000.0

4 Prem male 25.0 50000.0

In [20]:
print(dataset)
df9=dataset.fillna(method="bfill")
df9

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 5/9


12/9/22, 7:49 PM DATA PREPROCESSING

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[20]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 50000.0

2 Laxman female 25.0 50000.0

3 Rani female 25.0 50000.0

4 Prem male NaN 50000.0

In [21]:
print(dataset)
df10=dataset.fillna(method="ffill")
df10

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[21]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 20000.0

2 Laxman male 50.0 20000.0

3 Rani female 25.0 20000.0

4 Prem male 25.0 50000.0

In [22]:
print(dataset)
df11=dataset
df11["Gender"]=dataset["Gender"].fillna("male")
df11

Name Gender Age Salary


0 James male 30.0 20000.0
1 Kumar male 50.0 NaN
2 Laxman NaN NaN NaN
3 Rani female 25.0 NaN
4 Prem male NaN 50000.0
Out[22]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman male NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 6/9


12/9/22, 7:49 PM DATA PREPROCESSING

In [23]:
dataset.Gender.value_counts()

Out[23]: male 4
female 1
Name: Gender, dtype: int64

In [24]:
df11.Gender.value_counts()

Out[24]: male 4
female 1
Name: Gender, dtype: int64

In [25]:
dataset

Out[25]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman male NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [26]:
dataset1=dataset.fillna(dataset.mean())
dataset1

Out[26]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 35000.0

2 Laxman male 35.0 35000.0

3 Rani female 25.0 35000.0

4 Prem male 35.0 50000.0

In [28]:
dataset2=dataset.fillna(dataset.mode())
dataset2

Out[28]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 50000.0

2 Laxman male 50.0 NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [29]:
dataset3=dataset.fillna(dataset.median())

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 7/9


12/9/22, 7:49 PM DATA PREPROCESSING

In [30]:
dataset3

Out[30]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 35000.0

2 Laxman male 30.0 35000.0

3 Rani female 25.0 35000.0

4 Prem male 30.0 50000.0

In [31]:
Y=dataset.iloc[:,:].values
print(Y)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer=imputer.fit(Y)
Y=imputer.transform(Y)
Y

[['James' 'male' 30.0 20000.0]


['Kumar' 'male' 50.0 nan]
['Laxman' 'male' nan nan]
['Rani' 'female' 25.0 nan]
['Prem ' 'male' nan 50000.0]]
Out[31]: array([['James', 'male', 30.0, 20000.0],
['Kumar', 'male', 50.0, 20000.0],
['Laxman', 'male', 25.0, 20000.0],
['Rani', 'female', 25.0, 20000.0],
['Prem ', 'male', 25.0, 50000.0]], dtype=object)

In [32]:
dataset

Out[32]: Name Gender Age Salary

0 James male 30.0 20000.0

1 Kumar male 50.0 NaN

2 Laxman male NaN NaN

3 Rani female 25.0 NaN

4 Prem male NaN 50000.0

In [34]:
Z=dataset.iloc[:,2:4].values
print(Z)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer=imputer.fit(Z)
Z=imputer.transform(Z)
Z

[[3.0e+01 2.0e+04]
[5.0e+01 nan]
[ nan nan]
[2.5e+01 nan]
[ nan 5.0e+04]]

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 8/9


12/9/22, 7:49 PM DATA PREPROCESSING

Out[34]: array([[3.0e+01, 2.0e+04],


[5.0e+01, 2.0e+04],
[2.5e+01, 2.0e+04],
[2.5e+01, 2.0e+04],
[2.5e+01, 5.0e+04]])

localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 9/9


12/9/22, 8:01 PM program4 (1)

EXPERIMENT-4
Aim : Data Preprocessing - Categorical Data
DATA PREPROCESSING - CATEGORICAL DATA
Algorithm:
For a given set of training data examples stored in a CSV file. Preprocessing in Machine learning
with the following steps

a. Getting the dataset.

b. Importing libraries.

c. Importing data sets.

d. Finding Missing Data.

e. Encoding Categorical Data.

f. Splitting data set into training and test set.

g. Feature scaling.

Description:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased. Since machine learning model completely works
on mathematics and numbers, but if our dataset would have a categorical variable, then it may
create trouble while building the model. So it is necessary to encode these categorical variables
into numbers.

Program:
1) Get the Dataset
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the dataset
in our code, we usually put it into a CSV file. However, sometimes, we may also need to use an
HTML or xlsx file.

2) Importing Libraries
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

localhost:8888/nbconvert/html/Desktop/m l record/program4 (1).ipynb?download=false 1/5


12/9/22, 8:01 PM program4 (1)

3) Importing the Datasets


In [2]:
dataset = pd.read_csv('C:\\Users\\STAFF\\Downloads\\Data.csv')
dataset

Out[2]: Country Age Salary Purchased

0 France 44.0 72000.0 No

1 Spain 27.0 48000.0 Yes

2 Germany 30.0 54000.0 No

3 Spain 38.0 61000.0 No

4 Germany 40.0 NaN Yes

5 France 35.0 58000.0 Yes

6 Spain NaN 52000.0 No

7 France 48.0 79000.0 Yes

8 Germany 50.0 83000.0 No

9 France 37.0 67000.0 Yes

In [3]:
X = dataset.iloc[:, :-1].values
X

Out[3]: array([['France', 44.0, 72000.0],


['Spain', 27.0, 48000.0],
['Germany', 30.0, 54000.0],
['Spain', 38.0, 61000.0],
['Germany', 40.0, nan],
['France', 35.0, 58000.0],
['Spain', nan, 52000.0],
['France', 48.0, 79000.0],
['Germany', 50.0, 83000.0],
['France', 37.0, 67000.0]], dtype=object)

In [4]:
print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

In [5]:
y= dataset.iloc[:,3].values

In [6]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

localhost:8888/nbconvert/html/Desktop/m l record/program4 (1).ipynb?download=false 2/5


12/9/22, 8:01 PM program4 (1)

4) Handling Missing data:


In [7]:
print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

In [8]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [9]:
print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

5) Encoding Categorical data:


In [10]:
#Catgorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
X[:, 0]= label_encoder_x.fit_transform(X[:, 0])

In [11]:
print(X)

[[0 44.0 72000.0]


[2 27.0 48000.0]
[1 30.0 54000.0]
[2 38.0 61000.0]
[1 40.0 63777.77777777778]
[0 35.0 58000.0]
[2 38.77777777777778 52000.0]
[0 48.0 79000.0]
[1 50.0 83000.0]
[0 37.0 67000.0]]

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

localhost:8888/nbconvert/html/Desktop/m l record/program4 (1).ipynb?download=false 3/5


12/9/22, 8:01 PM program4 (1)

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='

In [13]:
X

Out[13]: array([[1.0, 0.0, 0.0, 44.0, 72000.0],


[0.0, 0.0, 1.0, 27.0, 48000.0],
[0.0, 1.0, 0.0, 30.0, 54000.0],
[0.0, 0.0, 1.0, 38.0, 61000.0],
[0.0, 1.0, 0.0, 40.0, 63777.77777777778],
[1.0, 0.0, 0.0, 35.0, 58000.0],
[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
[1.0, 0.0, 0.0, 48.0, 79000.0],
[0.0, 1.0, 0.0, 50.0, 83000.0],
[1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [15]:
print(y)

[0 1 0 0 1 1 0 1 0 1]

6) Splitting the dataset into the Training set


and Test set
In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_st

In [17]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]


[0.0 1.0 0.0 40.0 63777.77777777778]
[1.0 0.0 0.0 44.0 72000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 0.0 1.0 27.0 48000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 83000.0]
[1.0 0.0 0.0 35.0 58000.0]]

In [18]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]


[1.0 0.0 0.0 37.0 67000.0]]

In [19]:
print(y_train)

[0 1 0 0 1 1 0 1]

In [20]:
print(y_test)

[0 1]

7) Feature Scaling
localhost:8888/nbconvert/html/Desktop/m l record/program4 (1).ipynb?download=false 4/5
12/9/22, 8:01 PM program4 (1)

In [21]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]


[0.0 0.0 1.0 27.0 48000.0]
[0.0 1.0 0.0 30.0 54000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 1.0 0.0 40.0 63777.77777777778]
[1.0 0.0 0.0 35.0 58000.0]
[0.0 0.0 1.0 38.77777777777778 52000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 83000.0]
[1.0 0.0 0.0 37.0 67000.0]]

In [22]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [23]:
X_train

Out[23]: array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],


[0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
[1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
[0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
[0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
[1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
[0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
[1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
dtype=object)

In [24]:
X_test

Out[24]: array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],


[1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
dtype=object)

localhost:8888/nbconvert/html/Desktop/m l record/program4 (1).ipynb?download=false 5/5


12/9/22, 7:52 PM Decision Tree

EXPERIMENT-5
5.Decision Tree
DECISION TREE
Aim:-
Write a program to demonstrate the working of the decision tree based ID3 algorithm.Use an
appropriate data set for building the decision tree and apply this knowledge to classify an new
sample

Description:
A decision tree is a support tool with a tree-like structure that
models probable outcomes, cost of
resources,utilities, and possible consequences. Decision trees
provide a way to present algorithms with conditional
control statements. They include branches that represent decision-
making steps that can lead to a favorable result.

Decision trees typically consist of three different elements: Root Node: The top-level node
represents the ultimate objective or big decision you’re trying to make. Branches: Branches,
which stem from the root, represent different options—or courses of action—that are available
when making a particular decision. They are most commonly indicated with an arrow line and
often include associated costs, as well as the likelihood to occur. Leaf Node: The leaf nodes—
which are attached at the end of the branches—represent possible outcomes for each action.
There are typically two types of leaf nodes: square leaf nodes, which indicate another decision to
be made, and circle leaf nodes, which indicate a chance event or unknown outcome.

Algorithm:
Steps:

1. Importing the library files

2. Reading the Iris Dataset

3. Preprocessing

4. Split the dataset into training and testing

5. Build the model (Decision Tree Model)

6. Evaluate the performance of the Model

7. Visualize the model

Program:
1.Importing the Library Files
localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 1/9
12/9/22, 7:52 PM Decision Tree

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split

2.Reading the Iris Dataset


In [7]:
iris=pd.read_csv(r"C:\Users\rakes\Downloads\iris.csv")
iris

Out[7]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [8]:
print(iris.shape)
print(iris.head())

(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [9]:
print(iris.species)

0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 2/9


12/9/22, 7:52 PM Decision Tree

146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object

In [10]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

3.Preprocessing
In [12]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)

sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

[150 rows x 5 columns]


0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int64

In [17]:
x=iris.iloc[:,0:4]
x

Out[17]: sepal_length sepal_width petal_length petal_width

0 5.1 3.5 1.4 0.2

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 3/9


12/9/22, 7:52 PM Decision Tree

sepal_length sepal_width petal_length petal_width

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

... ... ... ... ...

145 6.7 3.0 5.2 2.3

146 6.3 2.5 5.0 1.9

147 6.5 3.0 5.2 2.0

148 6.2 3.4 5.4 2.3

149 5.9 3.0 5.1 1.8

150 rows × 4 columns

In [18]:
y=iris.iloc[:,4]
y

Out[18]: 0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int64

4.Split the dataset into training and testing


In [19]:
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.70,random_state=101)

In [20]:
print(x_test.shape)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)

(45, 4)
(105, 4)
(45,)
(105,)

5.Build the Model(Decision Tree Model)


In [22]:
clf_train=DecisionTreeClassifier(random_state=1234)
dt_train=clf_train.fit(x_train,y_train)

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 4/9


12/9/22, 7:52 PM Decision Tree

In [25]:
y_pred=dt_train.predict(x_test)
y_pred[:10]

Out[25]: array([0, 0, 0, 1, 1, 2, 1, 1, 2, 0], dtype=int64)

In [26]:
print(y_test[0:11])

33 0
16 0
43 0
129 2
50 1
123 2
68 1
53 1
146 2
1 0
147 2
Name: species, dtype: int64

6.Evaluate the Performance of the Model


In [27]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion

Out[27]: array([[13, 0, 0],


[ 0, 19, 1],
[ 0, 1, 11]], dtype=int64)

Accuracy of Test
In [30]:
print(dt_train.score(x_train,y_train))
print(dt_train.score(x_test,y_test))

1.0
0.9555555555555556

In [31]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)

precision recall f1-score support

0 1.00 1.00 1.00 13


1 0.95 0.95 0.95 20
2 0.92 0.92 0.92 12

accuracy 0.96 45
macro avg 0.96 0.96 0.96 45
weighted avg 0.96 0.96 0.96 45

7.Visualize The Model

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 5/9


12/9/22, 7:52 PM Decision Tree

7.1 Visualize The Decision Tree onn Training


Data
In [32]:
text_representation=tree.export_text(clf_train)
print(text_representation)

|--- feature_2 <= 2.45


| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_3 <= 1.65
| | |--- feature_2 <= 4.95
| | | |--- class: 1
| | |--- feature_2 > 4.95
| | | |--- feature_3 <= 1.55
| | | | |--- class: 2
| | | |--- feature_3 > 1.55
| | | | |--- class: 1
| |--- feature_3 > 1.65
| | |--- feature_2 <= 4.85
| | | |--- feature_1 <= 3.10
| | | | |--- class: 2
| | | |--- feature_1 > 3.10
| | | | |--- class: 1
| | |--- feature_2 > 4.85
| | | |--- class: 2

In [33]:
with open("decision_tree_train.log","w") as fout:
fout.write(text_representation)

In [34]:
fn=['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
cn=['setosa','versicolor','virginica']

In [35]:
fig=plt.figure(figsize=(25,20))
tree.plot_tree(clf_train,feature_names=fn,class_names=cn,filled=True)
fig.savefig('imagename.png')

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 6/9


12/9/22, 7:52 PM Decision Tree

In [36]:
fig.savefig("decision_tree_train.png")

7.2 Visualize The Decision Tree on Testing


Data
In [37]:
clf_test=DecisionTreeClassifier(random_state=1234)
dt_test=clf_test.fit(x_test,y_test)

In [38]:
text_representation=tree.export_text(clf_test)
print(text_representation)

|--- feature_2 <= 2.45


| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_2 <= 4.80
| | |--- class: 1
| |--- feature_2 > 4.80
| | |--- feature_3 <= 1.75
| | | |--- feature_2 <= 5.30
| | | | |--- class: 1
| | | |--- feature_2 > 5.30
| | | | |--- class: 2
| | |--- feature_3 > 1.75
| | | |--- class: 2

In [39]:
with open("decision_tree_test.log","w") as fout:
fout.write(text_representation)

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 7/9


12/9/22, 7:52 PM Decision Tree

In [41]:
fig=plt.figure(figsize=(25,20))
tree.plot_tree(clf_test,feature_names=fn,class_names=cn,filled=True)
fig.savefig('imagename1.png')

In [42]:
fig.savefig("decision_tree_test.png")

7.3 Visualize The Decision Tree on Overall


Data
In [45]:
clf=DecisionTreeClassifier(random_state=1234)
dt=clf.fit(x,y)

In [46]:
text_representation=tree.export_text(clf)
print(text_representation)

|--- feature_2 <= 2.45


| |--- class: 0
|--- feature_2 > 2.45
| |--- feature_3 <= 1.75
| | |--- feature_2 <= 4.95
| | | |--- feature_3 <= 1.65
| | | | |--- class: 1
| | | |--- feature_3 > 1.65
| | | | |--- class: 2
| | |--- feature_2 > 4.95
| | | |--- feature_3 <= 1.55
| | | | |--- class: 2
localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 8/9
12/9/22, 7:52 PM Decision Tree

| | | |--- feature_3 > 1.55


| | | | |--- feature_0 <= 6.95
| | | | | |--- class: 1
| | | | |--- feature_0 > 6.95
| | | | | |--- class: 2
| |--- feature_3 > 1.75
| | |--- feature_2 <= 4.85
| | | |--- feature_1 <= 3.10
| | | | |--- class: 2
| | | |--- feature_1 > 3.10
| | | | |--- class: 1
| | |--- feature_2 > 4.85
| | | |--- class: 2

In [49]:
with open("decision_tree.log","w") as fout:
fout.write(text_representation)

In [48]:
fig=plt.figure(figsize=(25,20))
tree.plot_tree(clf,feature_names=fn,class_names=cn,filled=True)
fig.savefig('imagename2.png')

localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 9/9


12/9/22, 7:53 PM Linear Regression (1)

LINEAR REGRESSION
EXPERIMENT-6
LINEAR REGRESSION
Build a linear regression model using python for a particular data set by a. Splitting Training data
and Test data. b. Evaluate the model(intercept and slope). c. Visualize the training set and testing
set d. Predicting the test set result e. Compare actual output values with predicted values.

Algorithm:
To Build a linear regression model using python for a particular
data set by
a. Splitting Training data and Test data.
b. Evaluate the model(intercept and slope.
c. Visualize the training set and testing set
d. Predicting the test set result.
e. Compare actual output values with predicted values.

Description:
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are
using to predict the other variable's value is called the independent variable.

This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression
fits a straight line or surface that minimizes the discrepancies between predicted and actual
output values. There are simple linear regression calculators that use a “least squares” method to
discover the best-fit line for a set of paired data. You then estimate the value of X (dependent
variable) from Y (independent variable).

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
data=pd.read_csv("weight-height1.csv")
data.head()

Out[3]: Gender Height Weight

0 Male 73.847017 241.893563

1 Male 68.781904 162.310473

2 Male 74.110105 212.740856

3 Male 71.730978 220.042470

4 Male 69.881796 206.349801

In [4]:
data.isnull().sum()

localhost:8888/nbconvert/html/Desktop/m l record/Linear Regression (1).ipynb?download=false 1/6


12/9/22, 7:53 PM Linear Regression (1)

Out[4]: Gender 0
Height 0
Weight 0
dtype: int64

In [5]:
data.shape

Out[5]: (10000, 3)

In [10]:
x1 = data.iloc[:, 0].values
y1 = data.iloc[:, 2].values
plt.scatter(x1,y1,label='Gender',color='Green',s=50)
plt.xlabel('Gender')
plt.ylabel('Weight')
plt.title('Gender vs Weight')
plt.legend()
plt.show()

In [8]:
x1 = data.iloc[:, 0].values
y1 = data.iloc[:, 1].values
plt.scatter(x1,y1,label='Gender',color='Green',s=50)
plt.xlabel('Gender')
plt.ylabel('Hight')
plt.title('Gender vs Hight')
plt.legend()
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/Linear Regression (1).ipynb?download=false 2/6


12/9/22, 7:53 PM Linear Regression (1)

In [27]:
x2 = data.iloc[:, 1].values
y2 = data.iloc[:, 2].values
plt.scatter(x2,y2,label='Height',color='blue',s=10)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Height vs Weight')
plt.legend(loc="lower right")
plt.show()

In [22]:
X = data.iloc[:,1:2].values
print(X)

[[73.84701702]
[68.78190405]
[74.11010539]
...
[63.86799221]
[69.03424313]
[61.94424588]]

In [14]:
y = data.iloc[:, 2].values
print(y)

[241.89356318 162.31047252 212.74085556 ... 128.47531878 163.85246135


113.64910268]

localhost:8888/nbconvert/html/Desktop/m l record/Linear Regression (1).ipynb?download=false 3/6


12/9/22, 7:53 PM Linear Regression (1)

In [16]:
Heightmin=X.min()
Heightmax=X.max()
Heightnorm=(X-Heightmin)/(Heightmax-Heightmin)
Weightmin=y.min()
Weightmax=y.max()
Weightnorm=(y-Weightmin)/(Weightmax-Weightmin)
print(Weightnorm)
print(Heightnorm)

[0.863139 0.4754764 0.72113127 ... 0.31065968 0.48298768 0.23843869]


[[0.79172838]
[0.58695829]
[0.8023644 ]
...
[0.38830089]
[0.59715974]
[0.31052854]]

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_stat
print(X_train)
print(X_test)
print(y_train)
print(y_test)

[[60.1085895 ]
[56.81031728]
[66.20296361]
...
[63.31981767]
[68.99744011]
[66.44514693]]
[[73.18176748]
[71.43337582]
[60.02674978]
...
[65.62500563]
[61.53011039]
[65.89275253]]
[ 98.08851735 84.17069477 164.63446367 ... 119.31598826 200.31301563
163.29338674]
[208.8391625 216.63399969 103.38694607 ... 188.31858763 106.3043173
158.1849857 ]

In [28]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
print(y_pred)

[214.02231091 200.50905719 112.34767127 ... 155.61638129 123.96708917


157.68578703]

In [36]:
plt.scatter(X_train, y_train, color = "yellow")
plt.plot(X_train, regressor.predict(X_train), color = 'black')
plt.title('Hight vs Weights (Training set)')
plt.xlabel('Hight')
plt.ylabel('Weight')
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/Linear Regression (1).ipynb?download=false 4/6


12/9/22, 7:53 PM Linear Regression (1)

In [37]:
plt.scatter(X_test, y_test, color = 'pink')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Hight vs weights (Test set)')
plt.xlabel('Height')
plt.ylabel('Weight')

Out[37]: Text(0, 0.5, 'Weight')

In [38]:
y_pred = regressor.predict(X_test)
print('Coefficients: ', regressor.coef_)
print("Mean squared error: %.2f" % np.mean((regressor.predict(X_test) - y_test) ** 2
print('Variance score: %.2f' % regressor.score(X_test, y_test))

Coefficients: [7.72896259]
Mean squared error: 143.23
Variance score: 0.86

In [43]:
knownvalue=int(input("Enter the value of hight:"))
findvalue=regressor.predict([[knownvalue]])
print("when the height value is",knownvalue,"that moment weight value is",findvalue)

Enter the value of hight:62


when the height value is 62 that moment weight value is [127.5988484]

localhost:8888/nbconvert/html/Desktop/m l record/Linear Regression (1).ipynb?download=false 5/6


12/9/22, 7:53 PM Linear Regression (1)

In [45]:
data["predicted_value"]=regressor.predict(X)
data.head()

Out[45]: Gender Height Weight predicted_value

0 Male 73.847017 241.893563 219.164000

1 Male 68.781904 162.310473 180.015931

2 Male 74.110105 212.740856 221.197400

3 Male 71.730978 220.042470 202.809216

4 Male 69.881796 206.349801 188.516954

localhost:8888/nbconvert/html/Desktop/m l record/Linear Regression (1).ipynb?download=false 6/6


12/9/22, 7:56 PM multicollinearity (1)

EXPERIMENT-7
MULTIPLE LINEAR REGRESSION
MULTILINEAR REGRESSION
DESCRIPTION
The variable should have a robust relationship with independent variables. However, any
unbiased variables shouldn’t have robust correlations among other independent variables.
Collinearity can be a linear affiliation among explanatory variables. Two variables are perfectly
collinear if there’s a particular linear relationship between them. Multicollinearity refers to a
situation at some stage in which two or greater explanatory variables in the course of a multiple
correlation model are pretty linearly related. We’ve perfect multicollinearity if the correlation
between impartial variables is good to 1 or -1. In practice, we do not often face ideal
multicollinearity for the duration of an information set. More commonly, the difficulty of
multicollinearity arises when there’s an approximately linear courting between two or more
unbiased variables.

Importing Libraries
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Read the Datasets


In [2]:
df=pd.read_csv(r"C:\Users\STAFF\Desktop\multilinear\housing.csv")

Check the Datasets


In [3]:
df.head()

Out[3]: 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00

0 0.02731 0.00 7.070 0 0.4690 6.4210 78...

1 0.02729 0.00 7.070 0 0.4690 7.1850 61...

2 0.03237 0.00 2.180 0 0.4580 6.9980 45...

3 0.06905 0.00 2.180 0 0.4580 7.1470 54...

4 0.02985 0.00 2.180 0 0.4580 6.4300 58...

In [4]:
from sklearn.datasets import load_boston
boston = load_boston()

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 1/12


12/9/22, 7:56 PM multicollinearity (1)

In [5]:
print(boston.data.shape)

(506, 13)

In [6]:
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']

In [7]:
print(boston.target)

[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4
18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6
25.3 24.7 21.2 19.3 20. 16.6 14.4 19.4 19.7 20.5 25. 23.4 18.9 35.4
24.7 31.6 23.3 19.6 18.7 16. 22.2 25. 33. 23.5 19.4 22. 17.4 20.9
24.2 21.7 22.8 23.4 24.1 21.4 20. 20.8 21.2 20.3 28. 23.9 24.8 22.9
23.9 26.6 22.5 22.2 23.6 28.7 22.6 22. 22.9 25. 20.6 28.4 21.4 38.7
43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22. 20.3 20.5 17.3 18.8 21.4
15.7 16.2 18. 14.3 19.2 19.6 23. 18.4 15.6 18.1 17.4 17.1 13.3 17.8
14. 14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
17. 15.6 13.1 41.3 24.3 23.3 27. 50. 50. 50. 22.7 25. 50. 23.8
23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
37.9 32.5 26.4 29.6 50. 32. 29.8 34.9 37. 30.5 36.4 31.1 29.1 50.
33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50. 22.6 24.4 22.5 24.4 20.
21.7 19.3 22.4 28.1 23.7 25. 23.3 28.7 21.5 23. 26.7 21.7 27.5 30.1
44.8 50. 37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29. 24. 25.1 31.5
23.7 23.3 22. 20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.8
29.6 42.8 21.9 20.9 44. 50. 36. 30.1 33.8 43.1 48.8 31. 36.5 22.8
30.7 50. 43.5 20.7 21.1 25.2 24.4 35.2 32.4 32. 33.2 33.1 29.1 35.1
45.4 35.4 46. 50. 32.2 22. 20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.9
21.7 28.6 27.1 20.3 22.5 29. 24.8 22. 26.4 33.1 36.1 28.4 33.4 28.2
22.8 20.3 16.1 22.1 19.4 21.6 23.8 16.2 17.8 19.8 23.1 21. 23.8 23.1
20.4 18.5 25. 24.6 23. 22.2 19.3 22.6 19.8 17.1 19.4 22.2 20.7 21.1
19.5 18.5 20.6 19. 18.7 32.7 16.5 23.9 31.2 17.5 17.2 23.1 24.5 26.6
22.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6 25. 19.9 20.8 16.8
21.9 27.5 21.9 23.1 50. 50. 50. 50. 50. 13.8 13.8 15. 13.9 13.3
13.1 10.2 10.4 10.9 11.3 12.3 8.8 7.2 10.5 7.4 10.2 11.5 15.1 23.2
9.7 13.8 12.7 13.1 12.5 8.5 5. 6.3 5.6 7.2 12.1 8.3 8.5 5.
11.9 27.9 17.2 27.5 15. 17.2 17.9 16.3 7. 7.2 7.5 10.4 8.8 8.4
16.7 14.2 20.8 13.4 11.7 8.3 10.2 10.9 11. 9.5 14.5 14.1 16.1 14.3
11.7 13.4 9.6 8.7 8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.6
14.1 13. 13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20. 16.4 17.7
19.5 20.2 21.4 19.9 19. 19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.3
16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.
8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
22. 11.9]

In [8]:
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset


---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribut


e 14) is usually the target.

:Attribute Information (in order):


- CRIM per capita crime rate by town
localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 2/12
12/9/22, 7:56 PM multicollinearity (1)

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.


- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherw
ise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.


https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mell
on University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic


prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that addre
ss regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data an


d Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Procee
dings on the Tenth International Conference of Machine Learning, 236-243, University
of Massachusetts, Amherst. Morgan Kaufmann.

In [9]:
import pandas as pd
bos = pd.DataFrame(boston.data)
print(bos.head())

0 1 2 3 4 5 6 7 8 9 10 \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7

11 12
0 396.90 4.98
1 396.90 9.14
2 392.83 4.03
3 394.63 2.94
4 396.90 5.33

In [10]:
bos['PRICE'] = boston.target

bos.head()

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 3/12


12/9/22, 7:56 PM multicollinearity (1)

Out[10]: 0 1 2 3 4 5 6 7 8 9 10 11 12 PRICE

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

In [ ]:
X = bos.drop('PRICE', axis = 1)
X

In [12]:
Y = bos['PRICE']
Y

Out[12]: 0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
...
501 22.4
502 20.6
503 23.9
504 22.0
505 11.9
Name: PRICE, Length: 506, dtype: float64

Testing And Training


In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_s
print(X_train.shape)

print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(354, 13)
(152, 13)
(354,)
(152,)

In [14]:
X_train

Out[14]: 0 1 2 3 4 5 6 7 8 9 10 11 12

445 10.67180 0.0 18.10 0.0 0.740 6.459 94.8 1.9879 24.0 666.0 20.2 43.06 23.98

428 7.36711 0.0 18.10 0.0 0.679 6.193 78.1 1.9356 24.0 666.0 20.2 96.73 21.52

481 5.70818 0.0 18.10 0.0 0.532 6.750 74.9 3.3317 24.0 666.0 20.2 393.07 7.74

55 0.01311 90.0 1.22 0.0 0.403 7.249 21.9 8.6966 5.0 226.0 17.9 395.93 4.81

488 0.15086 0.0 27.74 0.0 0.609 5.454 92.7 1.8209 4.0 711.0 20.1 395.09 18.06

... ... ... ... ... ... ... ... ... ... ... ... ... ...

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 4/12


12/9/22, 7:56 PM multicollinearity (1)

0 1 2 3 4 5 6 7 8 9 10 11 12

486 5.69175 0.0 18.10 0.0 0.583 6.114 79.8 3.5459 24.0 666.0 20.2 392.68 14.98

189 0.08370 45.0 3.44 0.0 0.437 7.185 38.9 4.5667 5.0 398.0 15.2 396.90 5.39

495 0.17899 0.0 9.69 0.0 0.585 5.670 28.8 2.7986 6.0 391.0 19.2 393.29 17.60

206 0.22969 0.0 10.59 0.0 0.489 6.326 52.5 4.3549 4.0 277.0 18.6 394.87 10.97

355 0.10659 80.0 1.91 0.0 0.413 5.936 19.5 10.5857 4.0 334.0 22.0 376.04 5.57

354 rows × 13 columns

In [15]:
# code source:https://medium.com/@haydar_ai/learning-data-science-day-9-linear-regre
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, Y_train)

Y_pred = lm.predict(X_test)

In [16]:
from sklearn.metrics import r2_score
r2=r2_score(Y_test, Y_pred)
print("r2_score value is :",r2)

r2_score value is : 0.6771696999851688

In [17]:
predicted= lm.predict([[0.10659,80.0,1.91,0.0,0.413,5.936,19.5,10.5857,4.0,334.0,22.
print (predicted)

[17.01702816]

In [18]:
bos1=bos

In [19]:
bos1

Out[19]: 0 1 2 3 4 5 6 7 8 9 10 11 12 PRICE

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4

502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6

503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9

504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0

505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

506 rows × 14 columns


localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 5/12
12/9/22, 7:56 PM multicollinearity (1)

In [20]:
bos1.head()

Out[20]: 0 1 2 3 4 5 6 7 8 9 10 11 12 PRICE

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

In [21]:
bos1.shape

Out[21]: (506, 14)

In [22]:
bos1.dtypes

Out[22]: 0 float64
1 float64
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
10 float64
11 float64
12 float64
PRICE float64
dtype: object

In [23]:
num=['float64']
num_vars=list(bos1.select_dtypes(include=num))

In [24]:
num_vars

Out[24]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 'PRICE']

In [25]:
bos1=bos1[num_vars]

In [26]:
bos1.shape

Out[26]: (506, 14)

In [27]:
bos1.isna().sum()

Out[27]: 0 0
1 0
2 0
localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 6/12
12/9/22, 7:56 PM multicollinearity (1)

3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
PRICE 0
dtype: int64

In [28]:
X = bos1.iloc[:,0:13]
X.shape
y=bos1.iloc[:,-1]
y.shape

Out[28]: (506,)

In [29]:
X.head()

Out[29]: 0 1 2 3 4 5 6 7 8 9 10 11 12

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

In [30]:
y.head()

Out[30]: 0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
Name: PRICE, dtype: float64

In [31]:
from sklearn.model_selection import train_test_split

In [32]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=123

In [33]:
X_train

Out[33]: 0 1 2 3 4 5 6 7 8 9 10 11 12

273 0.22188 20.0 6.96 1.0 0.464 7.691 51.8 4.3665 3.0 223.0 18.6 390.77 6.58

52 0.05360 21.0 5.64 0.0 0.439 6.511 21.1 6.8147 4.0 243.0 16.8 396.90 5.28

181 0.06888 0.0 2.46 0.0 0.488 6.144 62.2 2.5979 3.0 193.0 17.8 396.90 9.45

452 5.09017 0.0 18.10 0.0 0.713 6.297 91.8 2.3682 24.0 666.0 20.2 385.09 17.27

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 7/12


12/9/22, 7:56 PM multicollinearity (1)

0 1 2 3 4 5 6 7 8 9 10 11 12

381 15.87440 0.0 18.10 0.0 0.671 6.545 99.1 1.5192 24.0 666.0 20.2 396.90 21.08

... ... ... ... ... ... ... ... ... ... ... ... ... ...

98 0.08187 0.0 2.89 0.0 0.445 7.820 36.9 3.4952 2.0 276.0 18.0 393.53 3.57

476 4.87141 0.0 18.10 0.0 0.614 6.484 93.6 2.3053 24.0 666.0 20.2 396.21 18.68

322 0.35114 0.0 7.38 0.0 0.493 6.041 49.9 4.7211 5.0 287.0 19.6 396.90 7.70

382 9.18702 0.0 18.10 0.0 0.700 5.536 100.0 1.5804 24.0 666.0 20.2 396.90 23.60

365 4.55587 0.0 18.10 0.0 0.718 3.561 87.9 1.6132 24.0 666.0 20.2 354.70 7.12

354 rows × 13 columns

In [34]:
X_test

Out[34]: 0 1 2 3 4 5 6 7 8 9 10 11 12

410 51.13580 0.0 18.10 0.0 0.5970 5.757 100.0 1.4130 24.0 666.0 20.2 2.60 10.11

85 0.05735 0.0 4.49 0.0 0.4490 6.630 56.1 4.4377 3.0 247.0 18.5 392.30 6.53

280 0.03578 20.0 3.33 0.0 0.4429 7.820 64.5 4.6947 5.0 216.0 14.9 387.31 3.76

422 12.04820 0.0 18.10 0.0 0.6140 5.648 87.6 1.9512 24.0 666.0 20.2 291.55 14.10

199 0.03150 95.0 1.47 0.0 0.4030 6.975 15.3 7.6534 3.0 402.0 17.0 396.90 4.56

... ... ... ... ... ... ... ... ... ... ... ... ... ...

310 2.63548 0.0 9.90 0.0 0.5440 4.973 37.8 2.5194 4.0 304.0 18.4 350.45 12.64

91 0.03932 0.0 3.41 0.0 0.4890 6.405 73.9 3.0921 2.0 270.0 17.8 393.55 8.20

151 1.49632 0.0 19.58 0.0 0.8710 5.404 100.0 1.5916 5.0 403.0 14.7 341.60 13.28

426 12.24720 0.0 18.10 0.0 0.5840 5.837 59.7 1.9976 24.0 666.0 20.2 24.65 15.69

472 3.56868 0.0 18.10 0.0 0.5800 6.437 75.0 2.8965 24.0 666.0 20.2 393.37 14.36

152 rows × 13 columns

In [35]:
corrmatrix = X_train.corr()

In [36]:
corrmatrix

Out[36]: 0 1 2 3 4 5 6 7 8

0 1.000000 -0.185823 0.384390 -0.057423 0.415677 -0.222242 0.337658 -0.353118 0.605329

1 -0.185823 1.000000 -0.522038 -0.015666 -0.511122 0.295831 -0.566624 0.665621 -0.300111

2 0.384390 -0.522038 1.000000 0.018161 0.758886 -0.415828 0.654440 -0.688434 0.579931

3 -0.057423 -0.015666 0.018161 1.000000 0.074893 0.080803 0.071978 -0.061796 -0.013473

4 0.415677 -0.511122 0.758886 0.074893 1.000000 -0.325523 0.737702 -0.764562 0.638051

5 -0.222242 0.295831 -0.415828 0.080803 -0.325523 1.000000 -0.235201 0.201811 -0.235527

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 8/12


12/9/22, 7:56 PM multicollinearity (1)

0 1 2 3 4 5 6 7 8

6 0.337658 -0.566624 0.654440 0.071978 0.737702 -0.235201 1.000000 -0.749613 0.464805

7 -0.353118 0.665621 -0.688434 -0.061796 -0.764562 0.201811 -0.749613 1.000000 -0.475315

8 0.605329 -0.300111 0.579931 -0.013473 0.638051 -0.235527 0.464805 -0.475315 1.000000

9 0.555264 -0.333887 0.740166 -0.055190 0.698958 -0.336795 0.531105 -0.534683 0.893770

10 0.272980 -0.373930 0.396049 -0.108745 0.217334 -0.375801 0.274428 -0.230396 0.448710

11 -0.334038 0.165907 -0.354460 0.061620 -0.382578 0.140743 -0.268073 0.274031 -0.461296

12 0.467253 -0.425021 0.644862 -0.040215 0.624147 -0.615942 0.611358 -0.521627 0.507385

In [37]:
sns.heatmap(corrmatrix)

Out[37]: <AxesSubplot:>

Correlation between different features


In [38]:
def correlation(bos1, threshold):
correlated_cols=set()
corr_matrix=bos1.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i,j]) > threshold:
colname = corr_matrix.columns[i]
correlated_cols.add(colname)
return correlated_cols

In [39]:
correlation(X_train, 0.7)

Out[39]: {4, 6, 7, 9}

In [40]:
correlation(X_train, 0.6)

Out[40]: {4, 6, 7, 8, 9, 12}

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 9/12


12/9/22, 7:56 PM multicollinearity (1)

In [41]:
correlation(X_train, 0.9)

Out[41]: set()

In [42]:
correlation(X_train, 0.8)

Out[42]: {9}

In [43]:
correlation(X_train, 0.6)

Out[43]: {4, 6, 7, 8, 9, 12}

In [44]:
correlation(X_train, 0.7)

Out[44]: {4, 6, 7, 9}

In [45]:
corr_feature=correlation(X_train, 0.7)

In [46]:
corr_feature

Out[46]: {4, 6, 7, 9}

In [47]:
X_train.shape, X_test.shape

Out[47]: ((354, 13), (152, 13))

In [48]:
X_test

Out[48]: 0 1 2 3 4 5 6 7 8 9 10 11 12

410 51.13580 0.0 18.10 0.0 0.5970 5.757 100.0 1.4130 24.0 666.0 20.2 2.60 10.11

85 0.05735 0.0 4.49 0.0 0.4490 6.630 56.1 4.4377 3.0 247.0 18.5 392.30 6.53

280 0.03578 20.0 3.33 0.0 0.4429 7.820 64.5 4.6947 5.0 216.0 14.9 387.31 3.76

422 12.04820 0.0 18.10 0.0 0.6140 5.648 87.6 1.9512 24.0 666.0 20.2 291.55 14.10

199 0.03150 95.0 1.47 0.0 0.4030 6.975 15.3 7.6534 3.0 402.0 17.0 396.90 4.56

... ... ... ... ... ... ... ... ... ... ... ... ... ...

310 2.63548 0.0 9.90 0.0 0.5440 4.973 37.8 2.5194 4.0 304.0 18.4 350.45 12.64

91 0.03932 0.0 3.41 0.0 0.4890 6.405 73.9 3.0921 2.0 270.0 17.8 393.55 8.20

151 1.49632 0.0 19.58 0.0 0.8710 5.404 100.0 1.5916 5.0 403.0 14.7 341.60 13.28

426 12.24720 0.0 18.10 0.0 0.5840 5.837 59.7 1.9976 24.0 666.0 20.2 24.65 15.69

472 3.56868 0.0 18.10 0.0 0.5800 6.437 75.0 2.8965 24.0 666.0 20.2 393.37 14.36

152 rows × 13 columns

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 10/12


12/9/22, 7:56 PM multicollinearity (1)

In [49]:
X_train.drop(labels=corr_feature,axis=1, inplace=True)
X_test.drop(labels=corr_feature,axis=1, inplace=True)

C:\Users\STAFF\anaconda3\lib\site-packages\pandas\core\frame.py:4308: SettingWithCop
yWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/u


ser_guide/indexing.html#returning-a-view-versus-a-copy
return super().drop(

In [50]:
X_train.shape, X_test.shape

Out[50]: ((354, 9), (152, 9))

In [51]:
X_train

Out[51]: 0 1 2 3 5 8 10 11 12

273 0.22188 20.0 6.96 1.0 7.691 3.0 18.6 390.77 6.58

52 0.05360 21.0 5.64 0.0 6.511 4.0 16.8 396.90 5.28

181 0.06888 0.0 2.46 0.0 6.144 3.0 17.8 396.90 9.45

452 5.09017 0.0 18.10 0.0 6.297 24.0 20.2 385.09 17.27

381 15.87440 0.0 18.10 0.0 6.545 24.0 20.2 396.90 21.08

... ... ... ... ... ... ... ... ... ...

98 0.08187 0.0 2.89 0.0 7.820 2.0 18.0 393.53 3.57

476 4.87141 0.0 18.10 0.0 6.484 24.0 20.2 396.21 18.68

322 0.35114 0.0 7.38 0.0 6.041 5.0 19.6 396.90 7.70

382 9.18702 0.0 18.10 0.0 5.536 24.0 20.2 396.90 23.60

365 4.55587 0.0 18.10 0.0 3.561 24.0 20.2 354.70 7.12

354 rows × 9 columns

In [52]:
X_test

Out[52]: 0 1 2 3 5 8 10 11 12

410 51.13580 0.0 18.10 0.0 5.757 24.0 20.2 2.60 10.11

85 0.05735 0.0 4.49 0.0 6.630 3.0 18.5 392.30 6.53

280 0.03578 20.0 3.33 0.0 7.820 5.0 14.9 387.31 3.76

422 12.04820 0.0 18.10 0.0 5.648 24.0 20.2 291.55 14.10

199 0.03150 95.0 1.47 0.0 6.975 3.0 17.0 396.90 4.56

... ... ... ... ... ... ... ... ... ...

310 2.63548 0.0 9.90 0.0 4.973 4.0 18.4 350.45 12.64

91 0.03932 0.0 3.41 0.0 6.405 2.0 17.8 393.55 8.20

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 11/12


12/9/22, 7:56 PM multicollinearity (1)

0 1 2 3 5 8 10 11 12

151 1.49632 0.0 19.58 0.0 5.404 5.0 14.7 341.60 13.28

426 12.24720 0.0 18.10 0.0 5.837 24.0 20.2 24.65 15.69

472 3.56868 0.0 18.10 0.0 6.437 24.0 20.2 393.37 14.36

152 rows × 9 columns

In [53]:
from sklearn.linear_model import LinearRegression

lm = LinearRegression()
lm.fit(X_train, Y_train)

Y_pred = lm.predict(X_test)

In [54]:
from sklearn.metrics import r2_score
r3=r2_score(Y_test, Y_pred)
print("r2_score value is :",r3)

r2_score value is : -0.004212527987457415

In [55]:
corr_feature

Out[55]: {4, 6, 7, 9}

In [56]:
predicted= lm.predict([[0.10659,80.0,1.91,0.0,5.936,4.0,22.0,376.04,5.57]])
print (predicted)

[22.85590786]

localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 12/12


12/9/22, 7:57 PM
EXPERIMENT-8 Logistic regression

LOGISTIC REGRESSION
Aim:
To Write a python program for Implement Logistic Regression

Algorithm:
Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).

Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.

Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:

Logistic Regression Equation:

We know the equation of the straight line can be written as:

localhost:8888/nbconvert/html/Desktop/m l record/Logistic regression.ipynb?download=false 1/5


12/9/22, 7:57 PM Logistic regression

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

Description
Steps to implementing the Machine Learning algorithm

1. Identify the Problem


2. Import Libraries
3. load the dataset
4. Preprocess the data
5. Feature scale the data
6. Training and Testing the data
7. Build the Model
8. Performance of the model

Program
Import Libraries
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split

Read the dataset


In [2]:
db=pd.read_csv('diabetes.csv')

In [3]:
print(db.shape)
print(db.head())

(768, 9)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
localhost:8888/nbconvert/html/Desktop/m l record/Logistic regression.ipynb?download=false 2/5
12/9/22, 7:57 PM Logistic regression

1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1

In [4]:
x=db.iloc[:,0:8]

In [5]:
x

Out[5]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

0 6 148 72 35 0 33.6 0.627

1 1 85 66 29 0 26.6 0.351

2 8 183 64 0 0 23.3 0.672

3 1 89 66 23 94 28.1 0.167

4 0 137 40 35 168 43.1 2.288

... ... ... ... ... ... ... ...

763 10 101 76 48 180 32.9 0.171

764 2 122 70 27 0 36.8 0.340

765 5 121 72 23 112 26.2 0.245

766 1 126 60 0 0 30.1 0.349

767 1 93 70 31 0 30.4 0.315

768 rows × 8 columns

In [6]:
y=db.iloc[:,-1]

In [7]:
y

Out[7]: 0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64

Preprocess the data


localhost:8888/nbconvert/html/Desktop/m l record/Logistic regression.ipynb?download=false 3/5
12/9/22, 7:57 PM Logistic regression

In [8]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=sc.fit_transform(x)
x

Out[8]: array([[ 0.63994726, 0.84832379, 0.14964075, ..., 0.20401277,


0.46849198, 1.4259954 ],
[-0.84488505, -1.12339636, -0.16054575, ..., -0.68442195,
-0.36506078, -0.19067191],
[ 1.23388019, 1.94372388, -0.26394125, ..., -1.10325546,
0.60439732, -0.10558415],
...,
[ 0.3429808 , 0.00330087, 0.14964075, ..., -0.73518964,
-0.68519336, -0.27575966],
[-0.84488505, 0.1597866 , -0.47073225, ..., -0.24020459,
-0.37110101, 1.17073215],
[-0.84488505, -0.8730192 , 0.04624525, ..., -0.20212881,
-0.47378505, -0.87137393]])

Spliting dataset for training and testing


In [9]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=16)

Build the Model


In [10]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(random_state=16)
lr.fit(x_train,y_train)
pred=lr.predict(x_test)

In [13]:
import sklearn.metrics as metrics
cnf=metrics.confusion_matrix(y_true=y_test,y_pred=pred)
cnf

Out[13]: array([[116, 9],


[ 26, 41]], dtype=int64)

In [23]:
class_names=[0,1]
fig, ax=plt.subplots()
tick_marks=np.arange(len(class_names))
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)

sns.heatmap(pd.DataFrame(cnf),annot=True,cmap="YlGnBu",fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix',y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Out[23]: Text(0.5, 257.44, 'Predicted label')

localhost:8888/nbconvert/html/Desktop/m l record/Logistic regression.ipynb?download=false 4/5


12/9/22, 7:57 PM Logistic regression

classify the model


In [26]:
from sklearn.metrics import classification_report
target_names=['without diabetes','with diabetes']
print(classification_report(y_test,pred,target_names=target_names))

precision recall f1-score support

without diabetes 0.82 0.93 0.87 125


with diabetes 0.82 0.61 0.70 67

accuracy 0.82 192


macro avg 0.82 0.77 0.78 192
weighted avg 0.82 0.82 0.81 192

In [22]:
y_pred_proba=lr.predict_proba(x_test)[::,1]
fpr,tpr,_=metrics.roc_curve(y_test,y_pred_proba)
auc=metrics.roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/Logistic regression.ipynb?download=false 5/5


12/18/22, 4:32 PM
EXPERIMENT-9 Untitled5

CLUSTERING
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

In [2]:
df = pd.read_csv('Mall_Customers.csv')
df

Out[2]: CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

... ... ... ... ... ...

195 196 Female 35 120 79

196 197 Female 45 126 28

197 198 Male 32 126 74

198 199 Male 32 137 18

199 200 Male 30 137 83

200 rows × 5 columns

In [4]:
X_train = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

In [5]:
clustering = DBSCAN(eps=12.5, min_samples=4) .fit(X_train)
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering. labels_

In [6]:
DBSCAN_dataset.Cluster.value_counts().to_frame()

Out[6]: Cluster

0 112

2 34

3 24

-1 18

1 8

4 4

localhost:8888/nbconvert/html/Desktop/m l record/Untitled5.ipynb?download=false 1/2


12/18/22, 4:32 PM Untitled5

In [ ]:
outliers = DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]
fig2, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot( 'Annual Income (k$)', 'Spending Score (1-100)',
data-DBSCAN_dataset[DBSCAN_dataset[ 'Cluster']!=-1],
hue='Cluster', ax=axes[0], palette='Set2', legend='full', s=200)

sns.scatterplot('Age', 'Spending Score (1-100)',


data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!=-1],
hue='Cluster', palette="Set2", ax=axes[1], legend='full', s=200)
axes[0].scatter(outliers[ 'Annual Income (k$)'], outliers['Spending Score (1-100)'],
axes[1].scatter(outliers['Age'], outliers['Spending Score (1-100)'], s=10, label='ou
axes[0].legend()
axes[1].legend()
plt.setp(axes[0].get_legend().get_texts(), fontsize='12')
plt.setp(axes[1].get_legend().get_texts(), fontsize='12')
plt.show()

localhost:8888/nbconvert/html/Desktop/m l record/Untitled5.ipynb?download=false 2/2


12/9/22, 7:58 PM
EXPERIMENT-10 A KNN (1)

AIM:- KNN
10.A) Write a Python program to implement KNN and SVM

description:
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories. K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm. K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems. K-NN is a non-parametric
algorithm, which means it does not make any assumption on underlying data. It is also called a
lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.

algorithm
The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors

Step-2: Calculate the Euclidean distance of K number of neighbors

Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Step-4: Among these k neighbors, count the number of the data points in each category.

Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.

Step-6: Our model is ready.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
iris = pd.read_csv("iris (4).csv")
print(iris.shape)
print(iris.head())

(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
localhost:8888/nbconvert/html/Desktop/m l record/KNN (1).ipynb?download=false 1/4
12/9/22, 7:58 PM KNN (1)

In [2]:
iris

Out[2]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [3]:
print(iris.species)

0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object

In [4]:
from sklearn import preprocessing
le= preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)

sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

[150 rows x 5 columns]


0 0
localhost:8888/nbconvert/html/Desktop/m l record/KNN (1).ipynb?download=false 2/4
12/9/22, 7:58 PM KNN (1)

1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int32

In [8]:
X=iris.iloc[:,0:4]
y=iris.iloc[:,4]
y

Out[8]: 0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int32

In [10]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc=sc.fit(X)
X=sc.transform(X)

In [15]:
x_train,x_test,y_train,y_test=train_test_split(X,y,train_size=0.70,random_state=101)

In [20]:
from sklearn.neighbors import KNeighborsClassifier
ne=KNeighborsClassifier(n_neighbors=3)
ne.fit(x_train,y_train)

Out[20]: ▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)

In [21]:
y_pred=ne.predict(x_test)
y_pred[:10]

Out[21]: array([0, 0, 0, 2, 1, 2, 1, 1, 2, 0])

In [22]:
import sklearn.metrics as metrics
confusion =metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion

Out[22]: array([[13, 0, 0],


[ 0, 20, 0],
[ 0, 1, 11]], dtype=int64)

localhost:8888/nbconvert/html/Desktop/m l record/KNN (1).ipynb?download=false 3/4


12/9/22, 7:58 PM KNN (1)

In [23]:
metrics.accuracy_score(y_true=y_test,y_pred=y_pred)

Out[23]: 0.9777777777777777

In [25]:
class_wise= metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)

precision recall f1-score support

0 1.00 1.00 1.00 13


1 0.95 1.00 0.98 20
2 1.00 0.92 0.96 12

accuracy 0.98 45
macro avg 0.98 0.97 0.98 45
weighted avg 0.98 0.98 0.98 45

localhost:8888/nbconvert/html/Desktop/m l record/KNN (1).ipynb?download=false 4/4


EXPERIMENT-10 B
12/9/22, 7:59 PM SVM (2)

SVM
DESCRIPTION:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:

ALGORITHM:-
Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want
a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane
with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split

localhost:8888/nbconvert/html/Desktop/m l record/SVM (2).ipynb?download=false 1/6


12/9/22, 7:59 PM SVM (2)

In [3]:
iris=pd.read_csv("iris.csv")

In [4]:
print(iris.shape)
print(iris.head())

(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [5]:
iris

Out[5]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [9]:
print(iris.species)

0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object

In [10]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype

localhost:8888/nbconvert/html/Desktop/m l record/SVM (2).ipynb?download=false 2/6


12/9/22, 7:59 PM SVM (2)

--- ------ -------------- -----


0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

In [11]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)

sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

[150 rows x 5 columns]


0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int32

In [12]:
iris.describe()

Out[12]: sepal_length sepal_width petal_length petal_width species

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 5.843333 3.054000 3.758667 1.198667 1.000000

std 0.828066 0.433594 1.764420 0.763161 0.819232

min 4.300000 2.000000 1.000000 0.100000 0.000000

25% 5.100000 2.800000 1.600000 0.300000 0.000000

50% 5.800000 3.000000 4.350000 1.300000 1.000000

75% 6.400000 3.300000 5.100000 1.800000 2.000000

max 7.900000 4.400000 6.900000 2.500000 2.000000

localhost:8888/nbconvert/html/Desktop/m l record/SVM (2).ipynb?download=false 3/6


12/9/22, 7:59 PM SVM (2)

In [13]:
X=iris.iloc[:,0:4]
y=iris.iloc[:,4]

In [14]:
sc=preprocessing.StandardScaler()
sc=sc.fit(X)
X=sc.transform(X)

In [15]:
x_train,x_test,y_train,y_test=train_test_split(X,y,train_size=0.70,random_state=101)
print(x_test.shape)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)

(45, 4)
(105, 4)
(45,)
(105,)

In [16]:
from sklearn.svm import SVC
svc_model=SVC(C=.1,kernel='linear',gamma=1)
svc_model.fit(x_train,y_train)

Out[16]: SVC(C=0.1, gamma=1, kernel='linear')

In [17]:
y_pred=svc_model.predict(x_test)

In [18]:
y_pred[:10]

Out[18]: array([0, 0, 0, 2, 1, 2, 1, 1, 2, 0])

In [19]:
print(y_test[:11])

33 0
16 0
43 0
129 2
50 1
123 2
68 1
53 1
146 2
1 0
147 2
Name: species, dtype: int32

In [21]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion

Out[21]: array([[13, 0, 0],


[ 0, 19, 1],
[ 0, 0, 12]], dtype=int64)

In [22]:
print(svc_model.score(x_train,y_train))
print(svc_model.score(x_test,y_test))

localhost:8888/nbconvert/html/Desktop/m l record/SVM (2).ipynb?download=false 4/6


12/9/22, 7:59 PM SVM (2)

0.9714285714285714
0.9777777777777777

In [23]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)

precision recall f1-score support

0 1.00 1.00 1.00 13


1 1.00 0.95 0.97 20
2 0.92 1.00 0.96 12

accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45

In [25]:
from sklearn.svm import SVC
svc_model=SVC(kernel='rbf')
svc_model.fit(x_train,y_train)

Out[25]: SVC()

In [27]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion

Out[27]: array([[13, 0, 0],


[ 0, 19, 1],
[ 0, 0, 12]], dtype=int64)

In [28]:
print(svc_model.score(x_train,y_train))
print(svc_model.score(x_test,y_test))

0.9809523809523809
0.9777777777777777

In [29]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)

precision recall f1-score support

0 1.00 1.00 1.00 13


1 1.00 0.95 0.97 20
2 0.92 1.00 0.96 12

accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45

In [30]:
from sklearn.svm import SVC
svc_model=SVC(kernel='poly')
svc_model.fit(x_train,y_train)

Out[30]: SVC(kernel='poly')

In [31]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)

localhost:8888/nbconvert/html/Desktop/m l record/SVM (2).ipynb?download=false 5/6


12/9/22, 7:59 PM SVM (2)

confusion

Out[31]: array([[13, 0, 0],


[ 0, 19, 1],
[ 0, 0, 12]], dtype=int64)

In [32]:
print(svc_model.score(x_train,y_train))
print(svc_model.score(x_test,y_test))

0.9333333333333333
0.9111111111111111

In [33]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)

precision recall f1-score support

0 1.00 1.00 1.00 13


1 1.00 0.95 0.97 20
2 0.92 1.00 0.96 12

accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45

localhost:8888/nbconvert/html/Desktop/m l record/SVM (2).ipynb?download=false 6/6


12/9/22, 8:01 PM Naive bayes

EXPERIMENT-11
Aim:Naive bayes
Naïve Bayes Classifier Algorithm
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems. It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building
the fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it
predicts on the basis of the probability of an object. Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Steps
1.Understand the business problem 2.Import the library files 3.Load the dataset 4.Data preprocessing 5.Split the
data into train and test 6.Build the model (Naïve Bayes classifier) 7.Test the model 8.Performance Measures
9.Predict the class label for new data.

1.Understand the business problem


Let’s build a classifier that predicts whether I should play tennis given the forecast. It takes four attributes to
describe the forecast; namely, the outlook, the temperature, the humidity, and the presence or absence of wind.
Furthermore, the values of the four attributes are qualitative (also known as categorical). p(C_k |x_1,x_2,…,x_n )=
p(C_k ) ∏_(i=1)^n p(x_i |C_k )

2.Importing libraries
In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import gc
import os
os.getcwd()
from sklearn.preprocessing import LabelBinarizer

3.Load the dataset


In [10]:
weather = pd.read_csv("C:\\Users\\staff\\Downloads\\weather.csv")

In [11]:
print (weather.shape)
print (weather.head())

(14, 5)
Outlook Temperature Humidity Wind Class
0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes

localhost:8888/nbconvert/html/Desktop/m l record/Naive bayes.ipynb?download=false 1/6


12/9/22, 8:01 PM Naive bayes

In [12]:
weather

Out[12]: Outlook Temperature Humidity Wind Class

0 Sunny Hot High Weak No

1 Sunny Hot High Strong No

2 Overcast Hot High Weak Yes

3 Rain Mild High Weak Yes

4 Rain Cool Normal Weak Yes

5 Rain Cool Normal Strong No

6 Overcast Cool Normal Strong Yes

7 Sunny Mild High Weak No

8 Sunny Cool Normal Weak Yes

9 Rain Mild Normal Weak Yes

10 Sunny Mild Normal Strong Yes

11 Overcast Mild High Strong Yes

12 Overcast Hot Normal Weak Yes

13 Rain Mild High Strong No

4.Data preprocessing
In [13]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Outlook 14 non-null object
1 Temperature 14 non-null object
2 Humidity 14 non-null object
3 Wind 14 non-null object
4 Class 14 non-null object
dtypes: object(5)
memory usage: 688.0+ bytes

In [14]:
weather.describe()

Out[14]: Outlook Temperature Humidity Wind Class

count 14 14 14 14 14

unique 3 3 2 2 2

top Sunny Mild Normal Weak Yes

freq 5 6 7 8 9

In [20]:
X1=weather.iloc[:,0:4]

localhost:8888/nbconvert/html/Desktop/m l record/Naive bayes.ipynb?download=false 2/6


12/9/22, 8:01 PM Naive bayes

In [21]:
X1

Out[21]: Outlook Temperature Humidity Wind

0 Sunny Hot High Weak

1 Sunny Hot High Strong

2 Overcast Hot High Weak

3 Rain Mild High Weak

4 Rain Cool Normal Weak

5 Rain Cool Normal Strong

6 Overcast Cool Normal Strong

7 Sunny Mild High Weak

8 Sunny Cool Normal Weak

9 Rain Mild Normal Weak

10 Sunny Mild Normal Strong

11 Overcast Mild High Strong

12 Overcast Hot Normal Weak

13 Rain Mild High Strong

In [22]:
y=weather.iloc[:,4]

In [23]:
y

Out[23]: 0 No
1 No
2 Yes
3 Yes
4 Yes
5 No
6 Yes
7 No
8 Yes
9 Yes
10 Yes
11 Yes
12 Yes
13 No
Name: Class , dtype: object

In [25]:
from sklearn.preprocessing import LabelEncoder
Numerics=LabelEncoder()
X1['outlook_n']=Numerics.fit_transform (X1['Outlook'])
X1['Temp_n']=Numerics.fit_transform (X1['Temperature'])
X1[ 'Humidity_n']=Numerics.fit_transform(X1[ 'Humidity'])
X1['windy_n']=Numerics.fit_transform(X1[ 'Wind'])
X1

localhost:8888/nbconvert/html/Desktop/m l record/Naive bayes.ipynb?download=false 3/6


12/9/22, 8:01 PM Naive bayes

Out[25]: Outlook Temperature Humidity Wind outlook_n Temp_n Humidity_n windy_n

0 Sunny Hot High Weak 2 1 0 1

1 Sunny Hot High Strong 2 1 0 0

2 Overcast Hot High Weak 0 1 0 1

3 Rain Mild High Weak 1 2 0 1

4 Rain Cool Normal Weak 1 0 1 1

5 Rain Cool Normal Strong 1 0 1 0

6 Overcast Cool Normal Strong 0 0 1 0

7 Sunny Mild High Weak 2 2 0 1

8 Sunny Cool Normal Weak 2 0 1 1

9 Rain Mild Normal Weak 1 2 1 1

10 Sunny Mild Normal Strong 2 2 1 0

11 Overcast Mild High Strong 0 2 0 0

12 Overcast Hot Normal Weak 0 1 1 1

13 Rain Mild High Strong 1 2 0 0

In [26]:
X1=X1.drop(['Outlook', 'Temperature', 'Humidity', 'Wind'], axis='columns')
X1

Out[26]: outlook_n Temp_n Humidity_n windy_n

0 2 1 0 1

1 2 1 0 0

2 0 1 0 1

3 1 2 0 1

4 1 0 1 1

5 1 0 1 0

6 0 0 1 0

7 2 2 0 1

8 2 0 1 1

9 1 2 1 1

10 2 2 1 0

11 0 2 0 0

12 0 1 1 1

13 1 2 0 0

5.split the data into train and test


In [27]:
x_train, x_test, y_train, y_test = train_test_split(X1, y, train_size=0.70, random_s
localhost:8888/nbconvert/html/Desktop/m l record/Naive bayes.ipynb?download=false 4/6
12/9/22, 8:01 PM Naive bayes

In [28]:
x_train

Out[28]: outlook_n Temp_n Humidity_n windy_n

5 1 0 1 0

0 2 1 0 1

4 1 0 1 1

8 2 0 1 1

9 1 2 1 1

7 2 2 0 1

6 0 0 1 0

1 2 1 0 0

11 0 2 0 0

In [29]:
x_test

Out[29]: outlook_n Temp_n Humidity_n windy_n

12 0 1 1 1

2 0 1 0 1

3 1 2 0 1

13 1 2 0 0

10 2 2 1 0

In [30]:
print(x_test.shape)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)

(5, 4)
(9, 4)
(5,)
(9,)

6.build the model(Navie Bayes classifier)


In [31]:
from sklearn. naive_bayes import CategoricalNB
cnb =CategoricalNB()
y_pred = cnb.fit(x_train, y_train)
predictions=cnb.predict(x_test)

7. Test the mode


In [32]:
predictions[:10]

localhost:8888/nbconvert/html/Desktop/m l record/Naive bayes.ipynb?download=false 5/6


12/9/22, 8:01 PM Naive bayes

Out[32]: array(['Yes', 'No', 'Yes', 'No', 'Yes'], dtype='<U3')

In [33]:
x_test

Out[33]: outlook_n Temp_n Humidity_n windy_n

12 0 1 1 1

2 0 1 0 1

3 1 2 0 1

13 1 2 0 0

10 2 2 1 0

8 .Performance Measures
In [34]:
import sklearn.metrics as metrics

In [36]:
confusion= metrics.confusion_matrix (y_true =y_test, y_pred =predictions)
confusion

Out[36]: array([[1, 0],


[1, 3]], dtype=int64)

In [38]:
metrics.accuracy_score (y_true=y_test, y_pred=predictions)

Out[38]: 0.8

In [40]:
class_wise =metrics.classification_report (y_true=y_test, y_pred=predictions)
print(class_wise)

precision recall f1-score support

No 0.50 1.00 0.67 1


Yes 1.00 0.75 0.86 4

accuracy 0.80 5
macro avg 0.75 0.88 0.76 5
weighted avg 0.90 0.80 0.82 5

9. Predict the class label for new data.


In [41]:
predicted= cnb.predict([[0,2,0,0]])
print (predicted)

['Yes']

localhost:8888/nbconvert/html/Desktop/m l record/Naive bayes.ipynb?download=false 6/6


EXPERIMENT-12
12/18/22, 4:10 PM Untitled3

RANDOM FOREST
1. Importing the library files 2. Reading the Iris Dataset 3. Preprocessing 4. Split the dataset into training and
testing 5. Build the model (Random Forest Model) 6. Evaluate the performance of the Model
In [3]:
import numpy as пр
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split

In [4]:
iris = pd.read_csv("iris.csv")

In [5]:
print (iris. shape)
print(iris.head())

(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

In [6]:
iris

Out[6]: sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

... ... ... ... ... ...

145 6.7 3.0 5.2 2.3 virginica

146 6.3 2.5 5.0 1.9 virginica

147 6.5 3.0 5.2 2.0 virginica

148 6.2 3.4 5.4 2.3 virginica

149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [7]:
print(iris.species)

0 setosa
1 setosa

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 1/8


12/18/22, 4:10 PM Untitled3

2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object

In [8]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

In [13]:
g=sns.pairplot(iris,hue='species')

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 2/8


12/18/22, 4:10 PM Untitled3

In [14]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)

sepal_length sepal_width petal_length petal_width species


0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

[150 rows x 5 columns]


0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int32

In [21]:
X=iris.iloc[:,0:4]

In [22]:
X

Out[22]: sepal_length sepal_width petal_length petal_width

0 5.1 3.5 1.4 0.2

1 4.9 3.0 1.4 0.2

2 4.7 3.2 1.3 0.2

3 4.6 3.1 1.5 0.2

4 5.0 3.6 1.4 0.2

... ... ... ... ...

145 6.7 3.0 5.2 2.3

146 6.3 2.5 5.0 1.9

147 6.5 3.0 5.2 2.0

148 6.2 3.4 5.4 2.3

149 5.9 3.0 5.1 1.8

150 rows × 4 columns

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 3/8


12/18/22, 4:10 PM Untitled3

In [23]:
y=iris.iloc[:,4:]

In [24]:
y

Out[24]: species

0 0

1 0

2 0

3 0

4 0

... ...

145 2

146 2

147 2

148 2

149 2

150 rows × 1 columns

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_st

In [29]:
print (X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(105, 4)
(105, 1)
(45, 4)
(45, 1)

5. Build the model (Random Forest Model)


sklearn.ensemble.RandomForestClassifier

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gin i’,


max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf =0.0,
max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap =True,
oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=Fal se,
class_weight=None, ccp_alpha=0.0, max_samples=None)

A random forest classifier.

A random forest is a meta estimator that fits a number of decision tree classifiers on various
sub-samples of the dataset and uses averaging to improve the predictive accuracy and control

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 4/8


12/18/22, 4:10 PM Untitled3

over-fitting. The sub-sample size is controlled with the max_samples parameter if


bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Parameters:

n_estimatorsint, default=10@ The number of trees in the forest.

criterion{“gini”, “entropy”, “Log_Loss”}, default="gini” The function to measure the quality of a


split. Supported criteria are “gini” for the Gini impurity and “log_loss” and “entropy” both for the
Shannon information gain.

max_depthint, default=None The maximum depth of the tree. If None, then nodes are expanded
until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_splitint or float, default=2 The minimum number of samples required to split an


internal node:

° If int, then consider min_samples_split as the minimum number.

° If float, then min_samples split is a fraction and ceil(min_samples split * n_samples) are the
minimum number of samples for each split. min_samples_leafint or float, default=1 The
minimum number of samples required to be at a leaf node. A split point at any depth will only
be considered if it leaves at least min_samples_leaf training samples in each of the left and right
branches. This may have the effect of smoothing the model, especially in regression.

¢ If int, then consider min_samples leaf as the minimum number.

¢ If float, then min_samples leaf is a fraction and ceil(min_samples leaf * n_samples) are the
minimum num- ber of samples for each node.

min_weight_fraction_leaffloat, default=0.0 The minimum weighted fraction of the sum total of


weights (of all the input samples) required to be at a leaf node. Samples have equal weight
when sample_weight is not provided.

max_features{“sqrt”, “Log2”, None}, int or float, default="s qrt” The number of features to
consider when looking for the best split:

e¢ If int, then consider max_features features at each split.

ons float, then max_features is a fraction and max(1, int(max_features * n_featuresin)) features
are considered at each split. min_samples_leafint or float, default=1 The minimum number of
samples required to be at a leaf node. A split point at any depth will only be considered if it
leaves at least min_samples_leaf training samples in each of the left and right branches. This may
have the effect of smoothing the model, especially in regression.

¢ If int, then consider min_samples leaf as the minimum number.

¢ If float, then min_samples leaf is a fraction and ceil(min_samples leaf * n_samples) are the
minimum num- ber of samples for each node.

min_weight_fraction_leaffloat, default=0.0 The minimum weighted fraction of the sum total of


weights (of all the input samples) required to be at a leaf node. Samples have equal weight

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 5/8


12/18/22, 4:10 PM Untitled3

when sample_weight is not provided.

max_features{“sqrt”, “Log2”, None}, int or float, default="s qrt” The number of features to
consider when looking for the best split:

e¢ If int, then consider max_features features at each split.

ons float, then max_features is a fraction and max(1, int(max_features * n_featuresin)) features
are considered at each split.

In [30]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=3,max_depth=2, random_state=42)

In [33]:
# Fit RandomForestCLassifier
rfc.fit(X_train, y_train)

# Predict the test set Labels


y_pred = rfc.predict(X_test)

<ipython-input-33-864905ae552e>:2: DataConversionWarning: A column-vector y was pass


ed when a 1d array was expected. Please change the shape of y to (n_samples,), for e
xample using ravel().
rfc.fit(X_train, y_train)

In [37]:
from sklearn import tree
features = X.columns.values
classes = ['setosa', 'versicolor', 'virginica']
for estimator in rfc.estimators_:
print(estimator)
plt.figure(figsize=(12,6))
tree.plot_tree(estimator,
feature_names=features,
class_names=classes,
fontsize=8,
filled=True,
rounded=True)
plt.show()

DecisionTreeClassifier(max_depth=2, max_features='auto',
random_state=1608637542)

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 6/8


12/18/22, 4:10 PM Untitled3

DecisionTreeClassifier(max_depth=2, max_features='auto',
random_state=1273642419)

DecisionTreeClassifier(max_depth=2, max_features='auto',
random_state=1935803228)

In [39]:
from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)
sns.heatmap(cm, annot=True, fmt='d').set_title( 'confusion matrix (0 = setosa, 1 = v
print(classification_report(y_test,y_pred))

[[13 0 0]
[ 0 19 1]
[ 0 2 10]]
precision recall f1-score support

0 1.00 1.00 1.00 13


1 0.90 0.95 0.93 20
2 0.91 0.83 0.87 12

accuracy 0.93 45
macro avg 0.94 0.93 0.93 45
weighted avg 0.93 0.93 0.93 45

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 7/8


12/18/22, 4:10 PM Untitled3

In [40]:
print(rfc.score(X_train, y_train))
print(rfc.score(X_test, y_test))

0.9714285714285714
0.9333333333333333

In [ ]:
# Organizing feature names and importances in a DataFrame
features_df = pd.DataFrame({'features': rfc.feature_names_in_, 'importances': rfc.fe

# Sorting data from highest to Lowest


features_df_sorted = features_df.sort_values(by='importances', ascending=False)

# Barplot of the result without borders and axis Lines


g = sns.barplot(data=features_df_sorted, x='importances', y ='features', palette="ro
sns.despine(bottom = True, left = True)
g.set_title('Feature importances')
g.set(xlabel=None)
g.set(ylabel=None)
g.set(xticks=[])
for value in g.containers:
g.bar_label(value, padding=2)

In [ ]:

localhost:8888/nbconvert/html/Desktop/m l record/Untitled3.ipynb?download=false 8/8

You might also like