9 .ML Programs
9 .ML Programs
AIM: EXPERIMENT-1
Plotting for Exploratory Data Analysis
(EDA)
Description:
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore
data, and possibly formulate hypotheses that might cause new data collection and experiments.
EDA focuses more narrowly on checking assumptions required for model fitting and hypothesis
testing. It also checks while handling missing values and making transformations of variables as
needed.filling the counts with
EDA build a robust understanding of the data, issues associated with either the info or process.
it’s a scientific approach to get the story of the data.
ALGORITHM:
1.Load.csv files
2.Dataset Information
3.Data Cleaning/Wrangling
4.Group by names
5.Summary of Statistics
10.Removing Columns
11.Univariate Analysis
12.Bivariate Analysis
13.Multi-Variate Analysis
15.Correlation
In [10]:
import pandas as pd
import seaborn as sns
localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 1/21
12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)
Out[10]: 'C:\\Users\\STAFF'
In [11]:
iris=pd.read_csv("C:\\Users\\STAFF\\Desktop\\New folder\\iris.csv")
In [12]:
print(iris.shape)
print(iris.head())
(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [13]:
print(iris.columns)
In [14]:
iris["species"].value_counts()
Out[14]: setosa 50
virginica 50
versicolor 50
Name: species, dtype: int64
In [15]:
iris.plot(kind='scatter',x='sepal_length',y='sepal_width')
plt.show()
Observation
1.
In [16]:
sns.set_style("whitegrid")
sns.FacetGrid(iris,hue="species",height=4).map(plt.scatter,"sepal_length","sepal_wid
plt.show()
Observation
1. From the above graph we can differentiate the 3 species of iris namely setosa, versicolor
and virginica.
2. These 3 types of iris can be differentiated on the basis of their sepal_length and
sepal_width.
3. In the above graph setosa can be entirely differentiated from versicolor and virginica.
4.
In [17]:
plt.close()
sns.set_style("whitegrid")
sns.pairplot(iris,hue="species",height=3)
plt.show()
Obsevation
1. By using this graph we can easily classify setosa from virginica and versicolor.
2. From above graphs it is clear that petal_length and petal_width are 2 important plots.
3. For setosa petal_length is less than 2.
4. For setosa petal_width is 1. 5.
In [18]:
iris=pd.read_csv("C:\\Users\\STAFF\\Desktop\\New folder\\iris.csv")
iris_setosa = iris.loc[iris["species"] == "setosa"];
iris_virginica = iris.loc[iris["species"] == "virginica"];
iris_versicolor = iris.loc[iris["species"] == "versicolor"];
plt.plot(iris_setosa["petal_length"],np.zeros_like(iris_setosa['petal_length']),'o')
plt.plot(iris_virginica["petal_length"],np.zeros_like(iris_virginica['petal_length']
plt.plot(iris_versicolor["petal_length"],np.zeros_like(iris_versicolor['petal_length
In [19]:
sns.FacetGrid(iris,hue="species",height=5)\
.map(sns.histplot,"petal_length")\
.add_legend();
plt.show();
In [20]:
sns.FacetGrid(iris,hue="species",height=5)\
.map(sns.histplot,"sepal_width")\
.add_legend();
plt.show();
In [21]:
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges);
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf);
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=20,
density = True)
pdf = counts/(sum(counts))
plt.plot(bin_edges[1:],pdf);
plt.show();
In [22]:
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.show();
[0.02 0.02 0.04 0.14 0.24 0.28 0.14 0.08 0. 0.04]
[1. 1.09 1.18 1.27 1.36 1.45 1.54 1.63 1.72 1.81 1.9 ]
In [23]:
counts, bin_edges = np.histogram(iris_setosa['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_virginica['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
counts, bin_edges = np.histogram(iris_versicolor['petal_length'], bins=10,
density = True)
pdf = counts/(sum(counts))
print(pdf);
print(bin_edges)
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:],pdf)
plt.plot(bin_edges[1:], cdf)
plt.show();
In [24]:
print("Means:")
print(np.mean(iris_setosa["petal_length"]))
print(np.mean(np.append(iris_setosa["petal_length"],50)));
print(np.mean(iris_virginica["petal_length"]))
print(np.mean(iris_versicolor["petal_length"]))
print("\nStd-dev:");
print(np.std(iris_setosa["petal_length"]))
print(np.std(iris_virginica["petal_length"]))
print(np.std(iris_versicolor["petal_length"]))
Means:
1.464
2.4156862745098038
5.552
4.26
Std-dev:
0.17176728442867115
0.5463478745268441
0.4651881339845204
In [25]:
print("\nMedians:")
print(np.median(iris_setosa["petal_length"]))
print(np.median(np.append(iris_setosa["petal_length"],50)));
print(np.median(iris_virginica["petal_length"]))
print(np.median(iris_versicolor["petal_length"]))
print("\nQuantiles:")
print(np.percentile(iris_setosa["petal_length"],np.arange(0, 100, 25)))
print(np.percentile(iris_virginica["petal_length"],np.arange(0, 100, 25)))
print(np.percentile(iris_versicolor["petal_length"], np.arange(0, 100, 25)))
print("\n90th Percentiles:")
print(np.percentile(iris_setosa["petal_length"],90))
print(np.percentile(iris_virginica["petal_length"],90))
print(np.percentile(iris_versicolor["petal_length"], 90))
from statsmodels import robust
print ("\nMedian Absolute Deviation")
print(robust.mad(iris_setosa["petal_length"]))
print(robust.mad(iris_virginica["petal_length"]))
print(robust.mad(iris_versicolor["petal_length"]))
Medians:
1.5
1.5
5.55
4.35
Quantiles:
[1. 1.4 1.5 1.575]
localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 8/21
12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)
90th Percentiles:
1.7
6.31
4.8
In [26]:
sns.boxplot(x='species',y='petal_length', data=iris)
plt.show()
In [57]:
sns.violinplot(x="species",y="sepal_width",data=iris,size=8)
plt.show()
In [58]:
sns.violinplot(x="species",y="petal_length",data=iris,size=8)
plt.show()
In [59]:
sns.violinplot(x="species",y="sepal_length",data=iris,size=8)
plt.show()
In [51]:
sns.violinplot(x="species",y="petal_width",data=iris,size=8)
plt.show()
In [52]:
sns.jointplot(x="petal_length",y="petal_width",data=iris_setosa,kind="kde")
plt.show()
In [55]:
sns.jointplot(x="sepal_length",y="sepal_width",data=iris_versicolor,kind="kde")
plt.show()
In [56]:
sns.jointplot(x="petal_length",y="sepal_width",data=iris_virginica,kind="kde")
plt.show()
In [45]:
sns.jointplot(x="sepal_length",y="petal_width",data=iris_setosa,kind="kde")
plt.show()
In [31]:
iris_virginica_SW=iris_virginica.iloc[:,1]
iris_versicolor_SW=iris_versicolor.iloc[:,1]
In [32]:
from scipy import stats
stats.ks_2samp(iris_virginica_SW,iris_versicolor_SW)
In [33]:
x=stats.norm.rvs(loc=0.2,size=10)
stats.kstest(x,'norm')
In [34]:
x=stats.norm.rvs(loc=0.2,size=100)
stats.kstest(x,'norm')
In [35]:
x=stats.norm.rvs(loc=0.2,size=1000)
stats.kstest(x,'norm')
In [21]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [22]:
dataset=pd.read_csv("C:\\Users\\STAFF\\Desktop\\20981a4409\\Data1.csv")
dataset
In [23]:
print(dataset)
dataset.isnull()
In [24]:
x=dataset.iloc[:, :-1].values
x
In [25]:
print(x)
In [26]:
y=dataset.iloc[:,3].values
In [27]:
print(y)
In [28]:
print(x)
In [29]:
dataset
In [30]:
dataset[dataset.notnull()]
In [31]:
dataset.isnull().sum()
Out[31]: Name 0
Gender 1
Age 2
Salary 3
dtype: int64
In [32]:
dataset.shape
Out[32]: (5, 4)
In [33]:
df1=dataset.dropna()
df1
In [34]:
df1.shape
Out[34]: (1, 4)
In [35]:
df2=dataset.dropna(axis='columns')
df2
Out[35]: Name
0 James
1 Kumar
2 Laxman
3 Rani
4 Prem
In [36]:
dataset
In [37]:
df4=dataset.dropna(axis='columns',how='all')
df4
In [38]:
print(dataset)
df5=dataset.dropna(axis='rows',thresh=3)
df5
In [39]:
dataset
In [40]:
df6=dataset.fillna(0)
df6
In [41]:
df7=dataset.fillna(1)
df7
In [42]:
print(dataset)
df8=dataset.fillna(method="ffill")
df8
In [43]:
print(dataset)
df9=dataset.fillna(method="bfill")
df9
In [45]:
print(dataset)
df11=dataset
df11["Gender"]=dataset["Gender"].fillna("male")
df11
In [46]:
dataset.Gender.value_counts()
Out[46]: male 4
female 1
Name: Gender, dtype: int64
In [47]:
df11.Gender.value_counts()
Out[47]: male 4
female 1
Name: Gender, dtype: int64
In [48]:
dataset
In [49]:
dataset1=dataset.fillna(dataset.mean())
dataset1
In [50]:
dataset2=dataset.fillna(dataset.mode())
dataset2
In [51]:
dataset3=dataset.fillna(dataset.median())
dataset3
In [55]:
y=dataset.iloc[:,:].values
print(y)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer=imputer.fit(y)
y=imputer.transform(y)
y
In [56]:
z=dataset.iloc[:,2:4].values
print(z)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(z)
z=imputer.transform(z)
z
[[3.0e+01 2.0e+04]
[5.0e+01 nan]
[ nan nan]
[2.5e+01 nan]
[ nan 5.0e+04]]
Out[56]: array([[3.0e+01, 2.0e+04],
[5.0e+01, 3.5e+04],
[3.5e+01, 3.5e+04],
[2.5e+01, 3.5e+04],
[3.5e+01, 5.0e+04]])
In [57]:
x=dataset.iloc[:, :].values
x
In [58]:
from sklearn.preprocessing import LabelEncoder
label_encoder_x=LabelEncoder()
x[:, 1]=label_encoder_x.fit_transform(x[:, 1])
x
In [59]:
x=dataset.iloc[:, :].values
print(x)
In [61]:
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:, 2:])
x[:, 2:]=imputer.transform(x[:, 2:])
print(x)
In [62]:
x
localhost:8888/nbconvert/html/Desktop/m l record/IRIS FLOWER DATASET(EDA).ipynb?download=false 20/21
12/9/22, 7:45 PM IRIS FLOWER DATASET(EDA)
In [63]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x[:, 2:]=sc.fit_transform(x[:, 2:])
In [64]:
x
In [65]:
from sklearn.preprocessing import LabelEncoder
label_encoder_x=LabelEncoder()
x[:, 1]=label_encoder_x.fit_transform(x[:, 1])
x
EXPERIMENT-2
2.STATISTICAL OPERATIONS
STATISTICAL OPERATIONS USING PYTHON
Aim:-
by using python write a numpy program to compute the 1)Expected Value 2)Mean 3)Standard
Deviation 4)Variance 5)Covariance 6)Covariance Matrix Of Two Given Arrays
Description:-
Statistics is a pillar of machine learning. You cannot develop a deep understanding and
application of machine learning without it.
1)Expected Value:-An expected value gives a quick insight into the behavior of a random
variable without knowing if it is discrete or continuous.
4)Variance:-It is the expected value of the squared variation of a random variable from its mean
value, in probability and statistics. Informally, variance estimates how far a set of numbers
(random) are spread out from their mean value.
5)Covariance:- It is a measure of the relationship between two random variables and to what
extent, they change together.
6)Covariance Matrix Of Two Given Arrays:-In NumPy for computing the covariance matrix of two
given arrays with help of numpy.cov(). In this, we will pass the two arrays and it will return the
covariance matrix of two given arrays.
Algorithm:-
Step1:-Importing the library files.
Step6:-Model Evaluation.
In [6]:
import statistics
list = [1, 2, 3,3,4,5,]
In [7]:
import numpy as np
def expected_value(values, weights):
values = np.asarray(values)
weights = np.asarray(weights)
return (values * weights).sum() / weights.sum()
values = [0, 1, 2, 3, 4]
probs = [.18, .34, .35, .11, .02]
expected_value(values, probs)
Out[7]: 1.4500000000000002
In [8]:
import numpy as np
a = np.array([1, 2, 3, 4])
b=np.mean(a)
print("mean: ",b)
c=np.std(a)
print("std: ",c)
d=np.var(a)
print("variance: ",d)
e=np.cov(a)
print("co-variance: ",e)
mean: 2.5
std: 1.118033988749895
variance: 1.25
co-variance: 1.6666666666666665
In [9]:
array1 = np.array([3, 2])
array2 = np.array([7, 4])
# Covariance matrix
print("\nCovariance matrix :\n",
np.cov(array1, array2))
Covariance matrix :
[[0.5 1.5]
[1.5 4.5]]
AIM : DATAEXPERIMENT-3
PREPROCESSING - continous /
discrete
DATA data
PREPROCESSING-CONTINUOUS /DISCRETE DATA
ALGORITHM :
-> Getting the data set. -> Importing libraries. -> Importing datasets. -> Finding Missing Data. -> Finding
Outliers -> Splitting dataset into training and testing -> Feature scaling.
DESCRIPTION :
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put
in a formatted way. So for this, we use data preprocessing task.
PROGRAM:
1. IMPORT THE LIBRARY FILES
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [2]:
dataset = pd.read_csv('C:\\Users\\STAFF\\Downloads\\Data1.csv')
dataset
In [3]:
print(dataset)
In [4]:
print(dataset)
dataset.isnull()
In [5]:
dataset
In [6]:
dataset[dataset.notnull()]
In [7]:
dataset.isnull().sum()
Out[7]: Name 0
Gender 1
Age 2
Salary 3
dtype: int64
In [8]:
dataset.shape
localhost:8888/nbconvert/html/Desktop/m l record/DATA PREPROCESSING.ipynb?download=false 2/9
12/9/22, 7:49 PM DATA PREPROCESSING
Out[8]: (5, 4)
In [9]:
df1=dataset.dropna()
df1
In [10]:
df1.shape
Out[10]: (1, 4)
In [11]:
df2=dataset.dropna(axis='columns')
df2
Out[11]: Name
0 James
1 Kumar
2 Laxman
3 Rani
4 Prem
In [12]:
df3=dataset.dropna(axis=1)
df3
Out[12]: Name
0 James
1 Kumar
2 Laxman
3 Rani
4 Prem
In [13]:
dataset
In [14]:
df4=dataset.dropna(axis='columns',how='all')
df4
In [15]:
print(dataset)
df5=dataset.dropna(axis='rows',thresh=3)
df5
In [16]:
dataset
In [17]:
df6=dataset.fillna(0)
df6
In [18]:
df7=dataset.fillna(1)
df7
In [19]:
print(dataset)
df8=dataset.fillna(method="ffill")
df8
In [20]:
print(dataset)
df9=dataset.fillna(method="bfill")
df9
In [21]:
print(dataset)
df10=dataset.fillna(method="ffill")
df10
In [22]:
print(dataset)
df11=dataset
df11["Gender"]=dataset["Gender"].fillna("male")
df11
In [23]:
dataset.Gender.value_counts()
Out[23]: male 4
female 1
Name: Gender, dtype: int64
In [24]:
df11.Gender.value_counts()
Out[24]: male 4
female 1
Name: Gender, dtype: int64
In [25]:
dataset
In [26]:
dataset1=dataset.fillna(dataset.mean())
dataset1
In [28]:
dataset2=dataset.fillna(dataset.mode())
dataset2
In [29]:
dataset3=dataset.fillna(dataset.median())
In [30]:
dataset3
In [31]:
Y=dataset.iloc[:,:].values
print(Y)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer=imputer.fit(Y)
Y=imputer.transform(Y)
Y
In [32]:
dataset
In [34]:
Z=dataset.iloc[:,2:4].values
print(Z)
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputer=imputer.fit(Z)
Z=imputer.transform(Z)
Z
[[3.0e+01 2.0e+04]
[5.0e+01 nan]
[ nan nan]
[2.5e+01 nan]
[ nan 5.0e+04]]
EXPERIMENT-4
Aim : Data Preprocessing - Categorical Data
DATA PREPROCESSING - CATEGORICAL DATA
Algorithm:
For a given set of training data examples stored in a CSV file. Preprocessing in Machine learning
with the following steps
b. Importing libraries.
g. Feature scaling.
Description:
Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.
Categorical data is data which has some categories such as, in our dataset; there are two
categorical variable, Country, and Purchased. Since machine learning model completely works
on mathematics and numbers, but if our dataset would have a categorical variable, then it may
create trouble while building the model. So it is necessary to encode these categorical variables
into numbers.
Program:
1) Get the Dataset
Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the dataset
required for a liver patient. So each dataset is different from another dataset. To use the dataset
in our code, we usually put it into a CSV file. However, sometimes, we may also need to use an
HTML or xlsx file.
2) Importing Libraries
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In [3]:
X = dataset.iloc[:, :-1].values
X
In [4]:
print(X)
In [5]:
y= dataset.iloc[:,3].values
In [6]:
print(y)
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']
In [8]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
In [9]:
print(X)
In [11]:
print(X)
In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
In [13]:
X
In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
In [15]:
print(y)
[0 1 0 0 1 1 0 1 0 1]
In [17]:
print(X_train)
In [18]:
print(X_test)
In [19]:
print(y_train)
[0 1 0 0 1 1 0 1]
In [20]:
print(y_test)
[0 1]
7) Feature Scaling
localhost:8888/nbconvert/html/Desktop/m l record/program4 (1).ipynb?download=false 4/5
12/9/22, 8:01 PM program4 (1)
In [21]:
print(X)
In [22]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])
In [23]:
X_train
In [24]:
X_test
EXPERIMENT-5
5.Decision Tree
DECISION TREE
Aim:-
Write a program to demonstrate the working of the decision tree based ID3 algorithm.Use an
appropriate data set for building the decision tree and apply this knowledge to classify an new
sample
Description:
A decision tree is a support tool with a tree-like structure that
models probable outcomes, cost of
resources,utilities, and possible consequences. Decision trees
provide a way to present algorithms with conditional
control statements. They include branches that represent decision-
making steps that can lead to a favorable result.
Decision trees typically consist of three different elements: Root Node: The top-level node
represents the ultimate objective or big decision you’re trying to make. Branches: Branches,
which stem from the root, represent different options—or courses of action—that are available
when making a particular decision. They are most commonly indicated with an arrow line and
often include associated costs, as well as the likelihood to occur. Leaf Node: The leaf nodes—
which are attached at the end of the branches—represent possible outcomes for each action.
There are typically two types of leaf nodes: square leaf nodes, which indicate another decision to
be made, and circle leaf nodes, which indicate a chance event or unknown outcome.
Algorithm:
Steps:
3. Preprocessing
Program:
1.Importing the Library Files
localhost:8888/nbconvert/html/Desktop/m l record/Decision Tree.ipynb?download=false 1/9
12/9/22, 7:52 PM Decision Tree
In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
In [8]:
print(iris.shape)
print(iris.head())
(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [9]:
print(iris.species)
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object
In [10]:
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
3.Preprocessing
In [12]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)
In [17]:
x=iris.iloc[:,0:4]
x
In [18]:
y=iris.iloc[:,4]
y
Out[18]: 0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int64
In [20]:
print(x_test.shape)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)
(45, 4)
(105, 4)
(45,)
(105,)
In [25]:
y_pred=dt_train.predict(x_test)
y_pred[:10]
In [26]:
print(y_test[0:11])
33 0
16 0
43 0
129 2
50 1
123 2
68 1
53 1
146 2
1 0
147 2
Name: species, dtype: int64
Accuracy of Test
In [30]:
print(dt_train.score(x_train,y_train))
print(dt_train.score(x_test,y_test))
1.0
0.9555555555555556
In [31]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)
accuracy 0.96 45
macro avg 0.96 0.96 0.96 45
weighted avg 0.96 0.96 0.96 45
In [33]:
with open("decision_tree_train.log","w") as fout:
fout.write(text_representation)
In [34]:
fn=['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
cn=['setosa','versicolor','virginica']
In [35]:
fig=plt.figure(figsize=(25,20))
tree.plot_tree(clf_train,feature_names=fn,class_names=cn,filled=True)
fig.savefig('imagename.png')
In [36]:
fig.savefig("decision_tree_train.png")
In [38]:
text_representation=tree.export_text(clf_test)
print(text_representation)
In [39]:
with open("decision_tree_test.log","w") as fout:
fout.write(text_representation)
In [41]:
fig=plt.figure(figsize=(25,20))
tree.plot_tree(clf_test,feature_names=fn,class_names=cn,filled=True)
fig.savefig('imagename1.png')
In [42]:
fig.savefig("decision_tree_test.png")
In [46]:
text_representation=tree.export_text(clf)
print(text_representation)
In [49]:
with open("decision_tree.log","w") as fout:
fout.write(text_representation)
In [48]:
fig=plt.figure(figsize=(25,20))
tree.plot_tree(clf,feature_names=fn,class_names=cn,filled=True)
fig.savefig('imagename2.png')
LINEAR REGRESSION
EXPERIMENT-6
LINEAR REGRESSION
Build a linear regression model using python for a particular data set by a. Splitting Training data
and Test data. b. Evaluate the model(intercept and slope). c. Visualize the training set and testing
set d. Predicting the test set result e. Compare actual output values with predicted values.
Algorithm:
To Build a linear regression model using python for a particular
data set by
a. Splitting Training data and Test data.
b. Evaluate the model(intercept and slope.
c. Visualize the training set and testing set
d. Predicting the test set result.
e. Compare actual output values with predicted values.
Description:
Linear regression analysis is used to predict the value of a variable based on the value of another
variable. The variable you want to predict is called the dependent variable. The variable you are
using to predict the other variable's value is called the independent variable.
This form of analysis estimates the coefficients of the linear equation, involving one or more
independent variables that best predict the value of the dependent variable. Linear regression
fits a straight line or surface that minimizes the discrepancies between predicted and actual
output values. There are simple linear regression calculators that use a “least squares” method to
discover the best-fit line for a set of paired data. You then estimate the value of X (dependent
variable) from Y (independent variable).
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
data=pd.read_csv("weight-height1.csv")
data.head()
In [4]:
data.isnull().sum()
Out[4]: Gender 0
Height 0
Weight 0
dtype: int64
In [5]:
data.shape
Out[5]: (10000, 3)
In [10]:
x1 = data.iloc[:, 0].values
y1 = data.iloc[:, 2].values
plt.scatter(x1,y1,label='Gender',color='Green',s=50)
plt.xlabel('Gender')
plt.ylabel('Weight')
plt.title('Gender vs Weight')
plt.legend()
plt.show()
In [8]:
x1 = data.iloc[:, 0].values
y1 = data.iloc[:, 1].values
plt.scatter(x1,y1,label='Gender',color='Green',s=50)
plt.xlabel('Gender')
plt.ylabel('Hight')
plt.title('Gender vs Hight')
plt.legend()
plt.show()
In [27]:
x2 = data.iloc[:, 1].values
y2 = data.iloc[:, 2].values
plt.scatter(x2,y2,label='Height',color='blue',s=10)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Height vs Weight')
plt.legend(loc="lower right")
plt.show()
In [22]:
X = data.iloc[:,1:2].values
print(X)
[[73.84701702]
[68.78190405]
[74.11010539]
...
[63.86799221]
[69.03424313]
[61.94424588]]
In [14]:
y = data.iloc[:, 2].values
print(y)
In [16]:
Heightmin=X.min()
Heightmax=X.max()
Heightnorm=(X-Heightmin)/(Heightmax-Heightmin)
Weightmin=y.min()
Weightmax=y.max()
Weightnorm=(y-Weightmin)/(Weightmax-Weightmin)
print(Weightnorm)
print(Heightnorm)
In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_stat
print(X_train)
print(X_test)
print(y_train)
print(y_test)
[[60.1085895 ]
[56.81031728]
[66.20296361]
...
[63.31981767]
[68.99744011]
[66.44514693]]
[[73.18176748]
[71.43337582]
[60.02674978]
...
[65.62500563]
[61.53011039]
[65.89275253]]
[ 98.08851735 84.17069477 164.63446367 ... 119.31598826 200.31301563
163.29338674]
[208.8391625 216.63399969 103.38694607 ... 188.31858763 106.3043173
158.1849857 ]
In [28]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
print(y_pred)
In [36]:
plt.scatter(X_train, y_train, color = "yellow")
plt.plot(X_train, regressor.predict(X_train), color = 'black')
plt.title('Hight vs Weights (Training set)')
plt.xlabel('Hight')
plt.ylabel('Weight')
plt.show()
In [37]:
plt.scatter(X_test, y_test, color = 'pink')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Hight vs weights (Test set)')
plt.xlabel('Height')
plt.ylabel('Weight')
In [38]:
y_pred = regressor.predict(X_test)
print('Coefficients: ', regressor.coef_)
print("Mean squared error: %.2f" % np.mean((regressor.predict(X_test) - y_test) ** 2
print('Variance score: %.2f' % regressor.score(X_test, y_test))
Coefficients: [7.72896259]
Mean squared error: 143.23
Variance score: 0.86
In [43]:
knownvalue=int(input("Enter the value of hight:"))
findvalue=regressor.predict([[knownvalue]])
print("when the height value is",knownvalue,"that moment weight value is",findvalue)
In [45]:
data["predicted_value"]=regressor.predict(X)
data.head()
EXPERIMENT-7
MULTIPLE LINEAR REGRESSION
MULTILINEAR REGRESSION
DESCRIPTION
The variable should have a robust relationship with independent variables. However, any
unbiased variables shouldn’t have robust correlations among other independent variables.
Collinearity can be a linear affiliation among explanatory variables. Two variables are perfectly
collinear if there’s a particular linear relationship between them. Multicollinearity refers to a
situation at some stage in which two or greater explanatory variables in the course of a multiple
correlation model are pretty linearly related. We’ve perfect multicollinearity if the correlation
between impartial variables is good to 1 or -1. In practice, we do not often face ideal
multicollinearity for the duration of an information set. More commonly, the difficulty of
multicollinearity arises when there’s an approximately linear courting between two or more
unbiased variables.
Importing Libraries
In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
Out[3]: 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
In [4]:
from sklearn.datasets import load_boston
boston = load_boston()
In [5]:
print(boston.data.shape)
(506, 13)
In [6]:
print(boston.feature_names)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
In [7]:
print(boston.target)
[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4
18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6
25.3 24.7 21.2 19.3 20. 16.6 14.4 19.4 19.7 20.5 25. 23.4 18.9 35.4
24.7 31.6 23.3 19.6 18.7 16. 22.2 25. 33. 23.5 19.4 22. 17.4 20.9
24.2 21.7 22.8 23.4 24.1 21.4 20. 20.8 21.2 20.3 28. 23.9 24.8 22.9
23.9 26.6 22.5 22.2 23.6 28.7 22.6 22. 22.9 25. 20.6 28.4 21.4 38.7
43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8
18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22. 20.3 20.5 17.3 18.8 21.4
15.7 16.2 18. 14.3 19.2 19.6 23. 18.4 15.6 18.1 17.4 17.1 13.3 17.8
14. 14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4
17. 15.6 13.1 41.3 24.3 23.3 27. 50. 50. 50. 22.7 25. 50. 23.8
23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2
37.9 32.5 26.4 29.6 50. 32. 29.8 34.9 37. 30.5 36.4 31.1 29.1 50.
33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50. 22.6 24.4 22.5 24.4 20.
21.7 19.3 22.4 28.1 23.7 25. 23.3 28.7 21.5 23. 26.7 21.7 27.5 30.1
44.8 50. 37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29. 24. 25.1 31.5
23.7 23.3 22. 20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.8
29.6 42.8 21.9 20.9 44. 50. 36. 30.1 33.8 43.1 48.8 31. 36.5 22.8
30.7 50. 43.5 20.7 21.1 25.2 24.4 35.2 32.4 32. 33.2 33.1 29.1 35.1
45.4 35.4 46. 50. 32.2 22. 20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.9
21.7 28.6 27.1 20.3 22.5 29. 24.8 22. 26.4 33.1 36.1 28.4 33.4 28.2
22.8 20.3 16.1 22.1 19.4 21.6 23.8 16.2 17.8 19.8 23.1 21. 23.8 23.1
20.4 18.5 25. 24.6 23. 22.2 19.3 22.6 19.8 17.1 19.4 22.2 20.7 21.1
19.5 18.5 20.6 19. 18.7 32.7 16.5 23.9 31.2 17.5 17.2 23.1 24.5 26.6
22.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6 25. 19.9 20.8 16.8
21.9 27.5 21.9 23.1 50. 50. 50. 50. 50. 13.8 13.8 15. 13.9 13.3
13.1 10.2 10.4 10.9 11.3 12.3 8.8 7.2 10.5 7.4 10.2 11.5 15.1 23.2
9.7 13.8 12.7 13.1 12.5 8.5 5. 6.3 5.6 7.2 12.1 8.3 8.5 5.
11.9 27.9 17.2 27.5 15. 17.2 17.9 16.3 7. 7.2 7.5 10.4 8.8 8.4
16.7 14.2 20.8 13.4 11.7 8.3 10.2 10.9 11. 9.5 14.5 14.1 16.1 14.3
11.7 13.4 9.6 8.7 8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.6
14.1 13. 13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20. 16.4 17.7
19.5 20.2 21.4 19.9 19. 19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.3
16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7.
8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9
22. 11.9]
In [8]:
print(boston.DESCR)
.. _boston_dataset:
This dataset was taken from the StatLib library which is maintained at Carnegie Mell
on University.
The Boston house-price data has been used in many machine learning papers that addre
ss regression
problems.
.. topic:: References
In [9]:
import pandas as pd
bos = pd.DataFrame(boston.data)
print(bos.head())
0 1 2 3 4 5 6 7 8 9 10 \
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7
11 12
0 396.90 4.98
1 396.90 9.14
2 392.83 4.03
3 394.63 2.94
4 396.90 5.33
In [10]:
bos['PRICE'] = boston.target
bos.head()
Out[10]: 0 1 2 3 4 5 6 7 8 9 10 11 12 PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [ ]:
X = bos.drop('PRICE', axis = 1)
X
In [12]:
Y = bos['PRICE']
Y
Out[12]: 0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
...
501 22.4
502 20.6
503 23.9
504 22.0
505 11.9
Name: PRICE, Length: 506, dtype: float64
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
(354, 13)
(152, 13)
(354,)
(152,)
In [14]:
X_train
Out[14]: 0 1 2 3 4 5 6 7 8 9 10 11 12
445 10.67180 0.0 18.10 0.0 0.740 6.459 94.8 1.9879 24.0 666.0 20.2 43.06 23.98
428 7.36711 0.0 18.10 0.0 0.679 6.193 78.1 1.9356 24.0 666.0 20.2 96.73 21.52
481 5.70818 0.0 18.10 0.0 0.532 6.750 74.9 3.3317 24.0 666.0 20.2 393.07 7.74
55 0.01311 90.0 1.22 0.0 0.403 7.249 21.9 8.6966 5.0 226.0 17.9 395.93 4.81
488 0.15086 0.0 27.74 0.0 0.609 5.454 92.7 1.8209 4.0 711.0 20.1 395.09 18.06
... ... ... ... ... ... ... ... ... ... ... ... ... ...
0 1 2 3 4 5 6 7 8 9 10 11 12
486 5.69175 0.0 18.10 0.0 0.583 6.114 79.8 3.5459 24.0 666.0 20.2 392.68 14.98
189 0.08370 45.0 3.44 0.0 0.437 7.185 38.9 4.5667 5.0 398.0 15.2 396.90 5.39
495 0.17899 0.0 9.69 0.0 0.585 5.670 28.8 2.7986 6.0 391.0 19.2 393.29 17.60
206 0.22969 0.0 10.59 0.0 0.489 6.326 52.5 4.3549 4.0 277.0 18.6 394.87 10.97
355 0.10659 80.0 1.91 0.0 0.413 5.936 19.5 10.5857 4.0 334.0 22.0 376.04 5.57
In [15]:
# code source:https://medium.com/@haydar_ai/learning-data-science-day-9-linear-regre
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, Y_train)
Y_pred = lm.predict(X_test)
In [16]:
from sklearn.metrics import r2_score
r2=r2_score(Y_test, Y_pred)
print("r2_score value is :",r2)
In [17]:
predicted= lm.predict([[0.10659,80.0,1.91,0.0,0.413,5.936,19.5,10.5857,4.0,334.0,22.
print (predicted)
[17.01702816]
In [18]:
bos1=bos
In [19]:
bos1
Out[19]: 0 1 2 3 4 5 6 7 8 9 10 11 12 PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9
In [20]:
bos1.head()
Out[20]: 0 1 2 3 4 5 6 7 8 9 10 11 12 PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [21]:
bos1.shape
In [22]:
bos1.dtypes
Out[22]: 0 float64
1 float64
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
10 float64
11 float64
12 float64
PRICE float64
dtype: object
In [23]:
num=['float64']
num_vars=list(bos1.select_dtypes(include=num))
In [24]:
num_vars
In [25]:
bos1=bos1[num_vars]
In [26]:
bos1.shape
In [27]:
bos1.isna().sum()
Out[27]: 0 0
1 0
2 0
localhost:8888/nbconvert/html/Desktop/m l record/multicollinearity (1).ipynb?download=false 6/12
12/9/22, 7:56 PM multicollinearity (1)
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
PRICE 0
dtype: int64
In [28]:
X = bos1.iloc[:,0:13]
X.shape
y=bos1.iloc[:,-1]
y.shape
Out[28]: (506,)
In [29]:
X.head()
Out[29]: 0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
In [30]:
y.head()
Out[30]: 0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
Name: PRICE, dtype: float64
In [31]:
from sklearn.model_selection import train_test_split
In [32]:
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=123
In [33]:
X_train
Out[33]: 0 1 2 3 4 5 6 7 8 9 10 11 12
273 0.22188 20.0 6.96 1.0 0.464 7.691 51.8 4.3665 3.0 223.0 18.6 390.77 6.58
52 0.05360 21.0 5.64 0.0 0.439 6.511 21.1 6.8147 4.0 243.0 16.8 396.90 5.28
181 0.06888 0.0 2.46 0.0 0.488 6.144 62.2 2.5979 3.0 193.0 17.8 396.90 9.45
452 5.09017 0.0 18.10 0.0 0.713 6.297 91.8 2.3682 24.0 666.0 20.2 385.09 17.27
0 1 2 3 4 5 6 7 8 9 10 11 12
381 15.87440 0.0 18.10 0.0 0.671 6.545 99.1 1.5192 24.0 666.0 20.2 396.90 21.08
... ... ... ... ... ... ... ... ... ... ... ... ... ...
98 0.08187 0.0 2.89 0.0 0.445 7.820 36.9 3.4952 2.0 276.0 18.0 393.53 3.57
476 4.87141 0.0 18.10 0.0 0.614 6.484 93.6 2.3053 24.0 666.0 20.2 396.21 18.68
322 0.35114 0.0 7.38 0.0 0.493 6.041 49.9 4.7211 5.0 287.0 19.6 396.90 7.70
382 9.18702 0.0 18.10 0.0 0.700 5.536 100.0 1.5804 24.0 666.0 20.2 396.90 23.60
365 4.55587 0.0 18.10 0.0 0.718 3.561 87.9 1.6132 24.0 666.0 20.2 354.70 7.12
In [34]:
X_test
Out[34]: 0 1 2 3 4 5 6 7 8 9 10 11 12
410 51.13580 0.0 18.10 0.0 0.5970 5.757 100.0 1.4130 24.0 666.0 20.2 2.60 10.11
85 0.05735 0.0 4.49 0.0 0.4490 6.630 56.1 4.4377 3.0 247.0 18.5 392.30 6.53
280 0.03578 20.0 3.33 0.0 0.4429 7.820 64.5 4.6947 5.0 216.0 14.9 387.31 3.76
422 12.04820 0.0 18.10 0.0 0.6140 5.648 87.6 1.9512 24.0 666.0 20.2 291.55 14.10
199 0.03150 95.0 1.47 0.0 0.4030 6.975 15.3 7.6534 3.0 402.0 17.0 396.90 4.56
... ... ... ... ... ... ... ... ... ... ... ... ... ...
310 2.63548 0.0 9.90 0.0 0.5440 4.973 37.8 2.5194 4.0 304.0 18.4 350.45 12.64
91 0.03932 0.0 3.41 0.0 0.4890 6.405 73.9 3.0921 2.0 270.0 17.8 393.55 8.20
151 1.49632 0.0 19.58 0.0 0.8710 5.404 100.0 1.5916 5.0 403.0 14.7 341.60 13.28
426 12.24720 0.0 18.10 0.0 0.5840 5.837 59.7 1.9976 24.0 666.0 20.2 24.65 15.69
472 3.56868 0.0 18.10 0.0 0.5800 6.437 75.0 2.8965 24.0 666.0 20.2 393.37 14.36
In [35]:
corrmatrix = X_train.corr()
In [36]:
corrmatrix
Out[36]: 0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
In [37]:
sns.heatmap(corrmatrix)
Out[37]: <AxesSubplot:>
In [39]:
correlation(X_train, 0.7)
Out[39]: {4, 6, 7, 9}
In [40]:
correlation(X_train, 0.6)
In [41]:
correlation(X_train, 0.9)
Out[41]: set()
In [42]:
correlation(X_train, 0.8)
Out[42]: {9}
In [43]:
correlation(X_train, 0.6)
In [44]:
correlation(X_train, 0.7)
Out[44]: {4, 6, 7, 9}
In [45]:
corr_feature=correlation(X_train, 0.7)
In [46]:
corr_feature
Out[46]: {4, 6, 7, 9}
In [47]:
X_train.shape, X_test.shape
In [48]:
X_test
Out[48]: 0 1 2 3 4 5 6 7 8 9 10 11 12
410 51.13580 0.0 18.10 0.0 0.5970 5.757 100.0 1.4130 24.0 666.0 20.2 2.60 10.11
85 0.05735 0.0 4.49 0.0 0.4490 6.630 56.1 4.4377 3.0 247.0 18.5 392.30 6.53
280 0.03578 20.0 3.33 0.0 0.4429 7.820 64.5 4.6947 5.0 216.0 14.9 387.31 3.76
422 12.04820 0.0 18.10 0.0 0.6140 5.648 87.6 1.9512 24.0 666.0 20.2 291.55 14.10
199 0.03150 95.0 1.47 0.0 0.4030 6.975 15.3 7.6534 3.0 402.0 17.0 396.90 4.56
... ... ... ... ... ... ... ... ... ... ... ... ... ...
310 2.63548 0.0 9.90 0.0 0.5440 4.973 37.8 2.5194 4.0 304.0 18.4 350.45 12.64
91 0.03932 0.0 3.41 0.0 0.4890 6.405 73.9 3.0921 2.0 270.0 17.8 393.55 8.20
151 1.49632 0.0 19.58 0.0 0.8710 5.404 100.0 1.5916 5.0 403.0 14.7 341.60 13.28
426 12.24720 0.0 18.10 0.0 0.5840 5.837 59.7 1.9976 24.0 666.0 20.2 24.65 15.69
472 3.56868 0.0 18.10 0.0 0.5800 6.437 75.0 2.8965 24.0 666.0 20.2 393.37 14.36
In [49]:
X_train.drop(labels=corr_feature,axis=1, inplace=True)
X_test.drop(labels=corr_feature,axis=1, inplace=True)
C:\Users\STAFF\anaconda3\lib\site-packages\pandas\core\frame.py:4308: SettingWithCop
yWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [50]:
X_train.shape, X_test.shape
In [51]:
X_train
Out[51]: 0 1 2 3 5 8 10 11 12
273 0.22188 20.0 6.96 1.0 7.691 3.0 18.6 390.77 6.58
181 0.06888 0.0 2.46 0.0 6.144 3.0 17.8 396.90 9.45
452 5.09017 0.0 18.10 0.0 6.297 24.0 20.2 385.09 17.27
381 15.87440 0.0 18.10 0.0 6.545 24.0 20.2 396.90 21.08
... ... ... ... ... ... ... ... ... ...
476 4.87141 0.0 18.10 0.0 6.484 24.0 20.2 396.21 18.68
322 0.35114 0.0 7.38 0.0 6.041 5.0 19.6 396.90 7.70
382 9.18702 0.0 18.10 0.0 5.536 24.0 20.2 396.90 23.60
365 4.55587 0.0 18.10 0.0 3.561 24.0 20.2 354.70 7.12
In [52]:
X_test
Out[52]: 0 1 2 3 5 8 10 11 12
410 51.13580 0.0 18.10 0.0 5.757 24.0 20.2 2.60 10.11
280 0.03578 20.0 3.33 0.0 7.820 5.0 14.9 387.31 3.76
422 12.04820 0.0 18.10 0.0 5.648 24.0 20.2 291.55 14.10
199 0.03150 95.0 1.47 0.0 6.975 3.0 17.0 396.90 4.56
... ... ... ... ... ... ... ... ... ...
310 2.63548 0.0 9.90 0.0 4.973 4.0 18.4 350.45 12.64
0 1 2 3 5 8 10 11 12
151 1.49632 0.0 19.58 0.0 5.404 5.0 14.7 341.60 13.28
426 12.24720 0.0 18.10 0.0 5.837 24.0 20.2 24.65 15.69
472 3.56868 0.0 18.10 0.0 6.437 24.0 20.2 393.37 14.36
In [53]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, Y_train)
Y_pred = lm.predict(X_test)
In [54]:
from sklearn.metrics import r2_score
r3=r2_score(Y_test, Y_pred)
print("r2_score value is :",r3)
In [55]:
corr_feature
Out[55]: {4, 6, 7, 9}
In [56]:
predicted= lm.predict([[0.10659,80.0,1.91,0.0,5.936,4.0,22.0,376.04,5.57]])
print (predicted)
[22.85590786]
LOGISTIC REGRESSION
Aim:
To Write a python program for Implement Logistic Regression
Algorithm:
Logistic regression is one of the most popular Machine Learning algorithms, which comes under
the Supervised Learning technique. It is used for predicting the categorical dependent variable
using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image is
showing the logistic function:
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:
Description
Steps to implementing the Machine Learning algorithm
Program
Import Libraries
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
In [3]:
print(db.shape)
print(db.head())
(768, 9)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \
0 6 148 72 35 0 33.6
localhost:8888/nbconvert/html/Desktop/m l record/Logistic regression.ipynb?download=false 2/5
12/9/22, 7:57 PM Logistic regression
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1
In [4]:
x=db.iloc[:,0:8]
In [5]:
x
1 1 85 66 29 0 26.6 0.351
3 1 89 66 23 94 28.1 0.167
In [6]:
y=db.iloc[:,-1]
In [7]:
y
Out[7]: 0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64
In [8]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x=sc.fit_transform(x)
x
In [13]:
import sklearn.metrics as metrics
cnf=metrics.confusion_matrix(y_true=y_test,y_pred=pred)
cnf
In [23]:
class_names=[0,1]
fig, ax=plt.subplots()
tick_marks=np.arange(len(class_names))
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)
sns.heatmap(pd.DataFrame(cnf),annot=True,cmap="YlGnBu",fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix',y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
In [22]:
y_pred_proba=lr.predict_proba(x_test)[::,1]
fpr,tpr,_=metrics.roc_curve(y_test,y_pred_proba)
auc=metrics.roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
CLUSTERING
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
In [2]:
df = pd.read_csv('Mall_Customers.csv')
df
Out[2]: CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
In [4]:
X_train = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
In [5]:
clustering = DBSCAN(eps=12.5, min_samples=4) .fit(X_train)
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering. labels_
In [6]:
DBSCAN_dataset.Cluster.value_counts().to_frame()
Out[6]: Cluster
0 112
2 34
3 24
-1 18
1 8
4 4
In [ ]:
outliers = DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]
fig2, (axes) = plt.subplots(1,2,figsize=(12,5))
sns.scatterplot( 'Annual Income (k$)', 'Spending Score (1-100)',
data-DBSCAN_dataset[DBSCAN_dataset[ 'Cluster']!=-1],
hue='Cluster', ax=axes[0], palette='Set2', legend='full', s=200)
AIM:- KNN
10.A) Write a Python program to implement KNN and SVM
description:
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised
Learning technique. K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar to the available
categories. K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm. K-NN algorithm can be used for Regression as well as for
Classification but mostly it is used for the Classification problems. K-NN is a non-parametric
algorithm, which means it does not make any assumption on underlying data. It is also called a
lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
algorithm
The K-NN working can be explained on the basis of the below algorithm:
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
iris = pd.read_csv("iris (4).csv")
print(iris.shape)
print(iris.head())
(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
localhost:8888/nbconvert/html/Desktop/m l record/KNN (1).ipynb?download=false 1/4
12/9/22, 7:58 PM KNN (1)
In [2]:
iris
In [3]:
print(iris.species)
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object
In [4]:
from sklearn import preprocessing
le= preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int32
In [8]:
X=iris.iloc[:,0:4]
y=iris.iloc[:,4]
y
Out[8]: 0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: species, Length: 150, dtype: int32
In [10]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc=sc.fit(X)
X=sc.transform(X)
In [15]:
x_train,x_test,y_train,y_test=train_test_split(X,y,train_size=0.70,random_state=101)
In [20]:
from sklearn.neighbors import KNeighborsClassifier
ne=KNeighborsClassifier(n_neighbors=3)
ne.fit(x_train,y_train)
Out[20]: ▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=3)
In [21]:
y_pred=ne.predict(x_test)
y_pred[:10]
In [22]:
import sklearn.metrics as metrics
confusion =metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion
In [23]:
metrics.accuracy_score(y_true=y_test,y_pred=y_pred)
Out[23]: 0.9777777777777777
In [25]:
class_wise= metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)
accuracy 0.98 45
macro avg 0.98 0.97 0.98 45
weighted avg 0.98 0.98 0.98 45
SVM
DESCRIPTION:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using
a decision boundary or hyperplane:
ALGORITHM:-
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want
a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane
with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
In [3]:
iris=pd.read_csv("iris.csv")
In [4]:
print(iris.shape)
print(iris.head())
(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [5]:
iris
In [9]:
print(iris.species)
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object
In [10]:
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
In [11]:
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)
In [12]:
iris.describe()
In [13]:
X=iris.iloc[:,0:4]
y=iris.iloc[:,4]
In [14]:
sc=preprocessing.StandardScaler()
sc=sc.fit(X)
X=sc.transform(X)
In [15]:
x_train,x_test,y_train,y_test=train_test_split(X,y,train_size=0.70,random_state=101)
print(x_test.shape)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)
(45, 4)
(105, 4)
(45,)
(105,)
In [16]:
from sklearn.svm import SVC
svc_model=SVC(C=.1,kernel='linear',gamma=1)
svc_model.fit(x_train,y_train)
In [17]:
y_pred=svc_model.predict(x_test)
In [18]:
y_pred[:10]
In [19]:
print(y_test[:11])
33 0
16 0
43 0
129 2
50 1
123 2
68 1
53 1
146 2
1 0
147 2
Name: species, dtype: int32
In [21]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion
In [22]:
print(svc_model.score(x_train,y_train))
print(svc_model.score(x_test,y_test))
0.9714285714285714
0.9777777777777777
In [23]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)
accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45
In [25]:
from sklearn.svm import SVC
svc_model=SVC(kernel='rbf')
svc_model.fit(x_train,y_train)
Out[25]: SVC()
In [27]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion
In [28]:
print(svc_model.score(x_train,y_train))
print(svc_model.score(x_test,y_test))
0.9809523809523809
0.9777777777777777
In [29]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)
accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45
In [30]:
from sklearn.svm import SVC
svc_model=SVC(kernel='poly')
svc_model.fit(x_train,y_train)
Out[30]: SVC(kernel='poly')
In [31]:
import sklearn.metrics as metrics
confusion=metrics.confusion_matrix(y_true=y_test,y_pred=y_pred)
confusion
In [32]:
print(svc_model.score(x_train,y_train))
print(svc_model.score(x_test,y_test))
0.9333333333333333
0.9111111111111111
In [33]:
class_wise=metrics.classification_report(y_true=y_test,y_pred=y_pred)
print(class_wise)
accuracy 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45
EXPERIMENT-11
Aim:Naive bayes
Naïve Bayes Classifier Algorithm
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving
classification problems. It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building
the fast machine learning models that can make quick predictions. It is a probabilistic classifier, which means it
predicts on the basis of the probability of an object. Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Steps
1.Understand the business problem 2.Import the library files 3.Load the dataset 4.Data preprocessing 5.Split the
data into train and test 6.Build the model (Naïve Bayes classifier) 7.Test the model 8.Performance Measures
9.Predict the class label for new data.
2.Importing libraries
In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import gc
import os
os.getcwd()
from sklearn.preprocessing import LabelBinarizer
In [11]:
print (weather.shape)
print (weather.head())
(14, 5)
Outlook Temperature Humidity Wind Class
0 Sunny Hot High Weak No
1 Sunny Hot High Strong No
2 Overcast Hot High Weak Yes
3 Rain Mild High Weak Yes
4 Rain Cool Normal Weak Yes
In [12]:
weather
4.Data preprocessing
In [13]:
weather.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Outlook 14 non-null object
1 Temperature 14 non-null object
2 Humidity 14 non-null object
3 Wind 14 non-null object
4 Class 14 non-null object
dtypes: object(5)
memory usage: 688.0+ bytes
In [14]:
weather.describe()
count 14 14 14 14 14
unique 3 3 2 2 2
freq 5 6 7 8 9
In [20]:
X1=weather.iloc[:,0:4]
In [21]:
X1
In [22]:
y=weather.iloc[:,4]
In [23]:
y
Out[23]: 0 No
1 No
2 Yes
3 Yes
4 Yes
5 No
6 Yes
7 No
8 Yes
9 Yes
10 Yes
11 Yes
12 Yes
13 No
Name: Class , dtype: object
In [25]:
from sklearn.preprocessing import LabelEncoder
Numerics=LabelEncoder()
X1['outlook_n']=Numerics.fit_transform (X1['Outlook'])
X1['Temp_n']=Numerics.fit_transform (X1['Temperature'])
X1[ 'Humidity_n']=Numerics.fit_transform(X1[ 'Humidity'])
X1['windy_n']=Numerics.fit_transform(X1[ 'Wind'])
X1
In [26]:
X1=X1.drop(['Outlook', 'Temperature', 'Humidity', 'Wind'], axis='columns')
X1
0 2 1 0 1
1 2 1 0 0
2 0 1 0 1
3 1 2 0 1
4 1 0 1 1
5 1 0 1 0
6 0 0 1 0
7 2 2 0 1
8 2 0 1 1
9 1 2 1 1
10 2 2 1 0
11 0 2 0 0
12 0 1 1 1
13 1 2 0 0
In [28]:
x_train
5 1 0 1 0
0 2 1 0 1
4 1 0 1 1
8 2 0 1 1
9 1 2 1 1
7 2 2 0 1
6 0 0 1 0
1 2 1 0 0
11 0 2 0 0
In [29]:
x_test
12 0 1 1 1
2 0 1 0 1
3 1 2 0 1
13 1 2 0 0
10 2 2 1 0
In [30]:
print(x_test.shape)
print(x_train.shape)
print(y_test.shape)
print(y_train.shape)
(5, 4)
(9, 4)
(5,)
(9,)
In [33]:
x_test
12 0 1 1 1
2 0 1 0 1
3 1 2 0 1
13 1 2 0 0
10 2 2 1 0
8 .Performance Measures
In [34]:
import sklearn.metrics as metrics
In [36]:
confusion= metrics.confusion_matrix (y_true =y_test, y_pred =predictions)
confusion
In [38]:
metrics.accuracy_score (y_true=y_test, y_pred=predictions)
Out[38]: 0.8
In [40]:
class_wise =metrics.classification_report (y_true=y_test, y_pred=predictions)
print(class_wise)
accuracy 0.80 5
macro avg 0.75 0.88 0.76 5
weighted avg 0.90 0.80 0.82 5
['Yes']
RANDOM FOREST
1. Importing the library files 2. Reading the Iris Dataset 3. Preprocessing 4. Split the dataset into training and
testing 5. Build the model (Random Forest Model) 6. Evaluate the performance of the Model
In [3]:
import numpy as пр
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
In [4]:
iris = pd.read_csv("iris.csv")
In [5]:
print (iris. shape)
print(iris.head())
(150, 5)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [6]:
iris
In [7]:
print(iris.species)
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object
In [8]:
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
In [13]:
g=sns.pairplot(iris,hue='species')
In [14]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(iris.species)
iris['species']=le.transform(iris.species)
print(iris)
print(iris.species)
In [21]:
X=iris.iloc[:,0:4]
In [22]:
X
In [23]:
y=iris.iloc[:,4:]
In [24]:
y
Out[24]: species
0 0
1 0
2 0
3 0
4 0
... ...
145 2
146 2
147 2
148 2
149 2
In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.70, random_st
In [29]:
print (X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(105, 4)
(105, 1)
(45, 4)
(45, 1)
A random forest is a meta estimator that fits a number of decision tree classifiers on various
sub-samples of the dataset and uses averaging to improve the predictive accuracy and control
Parameters:
max_depthint, default=None The maximum depth of the tree. If None, then nodes are expanded
until all leaves are pure or until all leaves contain less than min_samples_split samples.
° If float, then min_samples split is a fraction and ceil(min_samples split * n_samples) are the
minimum number of samples for each split. min_samples_leafint or float, default=1 The
minimum number of samples required to be at a leaf node. A split point at any depth will only
be considered if it leaves at least min_samples_leaf training samples in each of the left and right
branches. This may have the effect of smoothing the model, especially in regression.
¢ If float, then min_samples leaf is a fraction and ceil(min_samples leaf * n_samples) are the
minimum num- ber of samples for each node.
max_features{“sqrt”, “Log2”, None}, int or float, default="s qrt” The number of features to
consider when looking for the best split:
ons float, then max_features is a fraction and max(1, int(max_features * n_featuresin)) features
are considered at each split. min_samples_leafint or float, default=1 The minimum number of
samples required to be at a leaf node. A split point at any depth will only be considered if it
leaves at least min_samples_leaf training samples in each of the left and right branches. This may
have the effect of smoothing the model, especially in regression.
¢ If float, then min_samples leaf is a fraction and ceil(min_samples leaf * n_samples) are the
minimum num- ber of samples for each node.
max_features{“sqrt”, “Log2”, None}, int or float, default="s qrt” The number of features to
consider when looking for the best split:
ons float, then max_features is a fraction and max(1, int(max_features * n_featuresin)) features
are considered at each split.
In [30]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=3,max_depth=2, random_state=42)
In [33]:
# Fit RandomForestCLassifier
rfc.fit(X_train, y_train)
In [37]:
from sklearn import tree
features = X.columns.values
classes = ['setosa', 'versicolor', 'virginica']
for estimator in rfc.estimators_:
print(estimator)
plt.figure(figsize=(12,6))
tree.plot_tree(estimator,
feature_names=features,
class_names=classes,
fontsize=8,
filled=True,
rounded=True)
plt.show()
DecisionTreeClassifier(max_depth=2, max_features='auto',
random_state=1608637542)
DecisionTreeClassifier(max_depth=2, max_features='auto',
random_state=1273642419)
DecisionTreeClassifier(max_depth=2, max_features='auto',
random_state=1935803228)
In [39]:
from sklearn.metrics import classification_report, confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
sns.heatmap(cm, annot=True, fmt='d').set_title( 'confusion matrix (0 = setosa, 1 = v
print(classification_report(y_test,y_pred))
[[13 0 0]
[ 0 19 1]
[ 0 2 10]]
precision recall f1-score support
accuracy 0.93 45
macro avg 0.94 0.93 0.93 45
weighted avg 0.93 0.93 0.93 45
In [40]:
print(rfc.score(X_train, y_train))
print(rfc.score(X_test, y_test))
0.9714285714285714
0.9333333333333333
In [ ]:
# Organizing feature names and importances in a DataFrame
features_df = pd.DataFrame({'features': rfc.feature_names_in_, 'importances': rfc.fe
In [ ]: