MLC Practical
MLC Practical
1
Open Command Prompt.
Type python --version and press Enter. You should see the version of Python
that you installed.
Basic Syntax
Operators
2
3
4
EXPERIMENT NO: 2
PANDAS
Pandas is a popular Python library for data analysis. It is not directly related
to Machine Learning. As we know that the dataset must be prepared before
training. In this case, Pandas comes handy as it was developed specifically for
data extraction and preparation. It provides high-level data structures and a
wide variety of tools for data analysis. It provides many inbuilt methods for
grouping, combining and filtering data.
NUMPY
NumPy is a very popular python library for large multi-dimensional array and
matrix processing, with the help of a large collection of high-level
mathematical functions. It is very useful for fundamental scientific
computations in Machine Learning. It is particularly useful for linear algebra,
Fourier transform, and random number capabilities. High-end libraries like
TensorFlow use NumPy internally for manipulation of Tensors.
5
MATPLOTLIB
Matplotlib is a very popular Python library for data visualisation. Like Pandas,
it is not directly related to Machine Learning. It particularly comes in handy
when a programmer wants to visualise the patterns in the data. It is a 2D
plotting library used for creating 2D graphs and plots. A module named pyplot
makes it easy for programmers for plotting as it provides features to control
line styles, font properties, formatting axes, etc. It provides various kinds of
graphs and plots for data visualisation, viz., histogram, error charts, bar
charts, etc,
6
SCIPY
SciPy is a very popular library among Machine Learning enthusiasts as it
contains different modules for optimization, linear algebra, integration and
statistics. There is a difference between the SciPy library and the SciPy stack.
The SciPy is one of the core packages that make up the SciPy stack. SciPy is
also very useful for image manipulation.
7
SCIKIT_LEARN
Scikit-learn is one of the most popular ML libraries for classical ML
algorithms. It is built on top of two basic Python libraries, viz., NumPy and
SciPy. Scikit-learn supports most of the supervised and unsupervised
learning algorithms. Scikit-learn can also be used for data-mining and
data-analysis, which makes it a great tool for those starting out with ML.
8
EXPERIMENT NO: 3
Theory:
In data analysis, missing data is a frequent issue that can hinder the accuracy
of models and insights. Handling this data properly is essential to ensure data
quality and reliable results. In Python, libraries like Pandas provide powerful
tools to detect and manage missing values.
By applying these techniques, analysts can ensure the dataset is complete and
suitable for further analysis, improving the overall accuracy and performance
of models.
Program:
9
c -0.595357 -0.062141 -0.679225
d NaN NaN NaN
e 0.910208 -1.230797 0.191110
f -0.062459 0.092898 1.320681
g NaN NaN NaN
h 1.313131 -0.963366 0.358444
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
a True
b False
c True
d False
e True
f True
g False
h True
Name: one, dtype: bool
2.6321528307002513
10
[5] import pandas as pd
import numpy as np
df = pd.DataFrame (np.random.randn(5, 3), index = [1, 2, 3, 4, 5],
columns = ['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df['one'].sum())
0.0
11
EXPERIMENT NO: 4
Theory:
Linear regression is a fundamental statistical method used to model the
relationship between a dependent variable (target) and one or more
independent variables (predictors). It is one of the most basic and widely used
forms of predictive modeling in machine learning and statistics.
[2] df = pd.read_csv('homeprice.csv')
[3] df
Area Price
0 1000 230000
1 1300 270000
2 3000 620000
3 2600 570000
4 3200 660000
5 2100 510000
[4] plt.xlabel('Area')
plt.ylabel('Price')
plt.scatter(df.Area, df.Price, color='red', marker='+')
<matplotlib.collections.PathCollection at 0xec17800>
12
[5] x = df.iloc[:, 0].values.reshape(-1, 1)
x
array([[1000],
[1300],
[3000],
[2600],
[3200],
[2100]], dtype=int64)
array([[230000],
[270000],
[620000],
[570000],
[660000],
[510000]], dtype=int64)
LinearRegression()
13
Predict price of home with area = 3300 sq.ft.
[8] reg.predict([[3300]])
array([[697208.53858785]])
[9] reg.coef_
array([[200.49261084]])
[10] reg.intercept_
array([35582.9228243])
697208.5385963024
[12] reg.predict([[5000]])
array([[1038045.97701149]])
Regression Line
[14] plt.scatter(x, y)
plt.plot(x, y_pred,color='red')
plt.show()
14
Generate CSV file with list of home price predictions.
[16] area_df
Area
0 1100
1 1600
2 2000
3 2200
4 2400
5 2800
6 3400
7 4000
[17] p = reg.predict(area_df)
15
C:\Users\admin\anaconda3\Lib\site-packages\sklearn\base.py:486: UserWarning: X
has feature names, but LinearRegression was fitted without feature names
warnings.warn
[18] p
array([[256124.79474548],
[356371.1001642 ],
[436568.14449918],
[476666.66666667],
[516765.18883415],
[596962.23316913],
[717257.79967159],
[837553.36617406]])
[19] area_df['price']=p
[20] area_df
Area price
0 1100 256124.794745
1 1600 356371.100164
2 2000 436568.144499
3 2200 476666.666667
4 2400 516765.188834
5 2800 596962.233169
6 3400 717257.799672
7 4000 837553.366174
area_df.to_csv("prediction.csv")
16
EXPERIMENT NO: 5
Theory:
Simple Linear Regression, where a single Independent/Predictor(X) variable is
used to model the response variable (Y). But there may be various cases in
which the response variable is affected by more than one predictor variable;
for such cases, the Multiple Linear Regression algorithm is used. Moreover,
Multiple Linear Regression is an extension of Simple Linear regression as it
takes more than one predictor variable to predict the response variable. We
can define it as:
Definition: Multiple Linear Regression is one of the important regression
algorithms which models the linear relationship between a single dependent
continuous variable and more than one independent variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a
car.
17
18
19
EXPERIMENT NO: 6
Theory:
Logistic Regression
It maps any real value into another value within a range of 0 and 1. The value
of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
On the basis of the categories, Logistic Regression can be classified into three
types:
Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
20
Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
Program:
In [ ]: import pandas as pd
import numpy as np
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current
browser session. Please rerun this cell to enable.
In [ ]: df = pd.read_csv('insurance_data.csv')
df
Out[ ]:
21
In [ ]: import matplotlib.pyplot as plt
plt.scatter(df.Age,df.Bought_Insurance,marker='+',color='red')
Out[ ]:
<matplotlib.collections.PathCollection at 0x7ab58c242fb0>
In [ ]: x_train,x_test,y_train,y_test =
train_test_split(df[['Age']],df.Bought_Insurance,train_size=0.8)
In [ ]: x_test
Out[ ]:
22
In [ ]: x_train
Out[ ]:
23
In [ ]: from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
Out[ ]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this
page with nbviewer.org.
In [ ]: y_predicted = model.predict(x_test)
In [14]: model.predict_proba(x_test)
Out[14]:
array([[0.04246871, 0.95753129],
[0.19494106, 0.80505894],
[0.04246871, 0.95753129],
[0.97525075, 0.02474925],
[0.96265717, 0.03734283],
[0.13674865, 0.86325135]])
In [16]: confusion_matrix(y_test,y_predicted)
Out[16]:
array([[1, 1],
[1, 3]])
print(classification_report(y_test,y_predicted))
24
Classification Report
accuracy 0.67 6
25
EXPERIMENT NO: 7
Theory:
Logistic regression can be extended to handle multiclass classification
problems using several approaches. Unlike binary logistic regression, which
deals with two classes, multiclass classification involves more than two
classes.
Program:
In [ ]: digits=load_digits()
In [ ]: digits.data.shape
for i in range(5):
plt.matshow(digits.images[i])
26
27
In [ ]: dir(digits)
28
In [ ]: digits.DESCR[0]
Out[ ]: '.'
In [ ]: digits.data[0]
Out[ ]: array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10.,
15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4.,
12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8.,
0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.])
In [ ]: digits.target[0]
Out[ ]: 0
In [ ]: digits.target_names[0]
Out[ ]: 0
Create and train logistic regression model.
In [ ]: from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
In [ ]: model.fit(x_train, y_train)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:460:
ConvergenceWarning: lbfgs failed to converge (status=1):
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
29
n_iter_i = _check_optimize_result(
Out[ ]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this
page with nbviewer.org.
In [ ]: x_test
...,
Out[ ]: 0.9638888888888889
In [ ]: model.predict(digits.data[0:10])
confusion_matrix(y_test, y_pred)
[ 0, 38, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 35, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 36, 0, 1, 0, 1, 2, 0],
[ 0, 0, 0, 0, 29, 0, 0, 0, 0, 0],
[ 0, 1, 0, 0, 0, 30, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 36, 0, 0, 0],
30
[ 0, 1, 0, 0, 0, 0, 0, 31, 0, 1],
[ 0, 0, 0, 0, 0, 0, 0, 0, 39, 0],
[ 0, 1, 0, 0, 0, 1, 0, 0, 2, 34]])
In [ ]: from sklearn.metrics import classification_report
Classification Report:
31
EXPERIMENT NO: 8
Theory:
Bayes' Theorem
The foundation of Naive Bayes lies in Bayes' Theorem, which helps calculate
the probability of a hypothesis (label) given some evidence (features). The
formula for Bayes' Theorem is:
Where:
This independence assumption is rarely true in practice, but Naive Bayes still
works well in many cases due to its simplicity and efficiency.
32
3. Bernoulli Naive Bayes: Suitable for binary/boolean data, often used
when the features are represented as binary values (e.g., the presence or
absence of a word in text classification).
Applications
Naive Bayes is commonly used in:
Program:
In [4]: from sklearn.naive_bayes import GaussianNB
iris = datasets.load_iris()
x = iris.data
y = iris.target
gnb = GaussianNB()
mnb = MultinomialNB()
cnf_matrix_gnb
[ 0, 18, 0],
33
[ 0, 0, 11]])
In [5]: from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_gnb))
ans
Out[6]: array([1])
In [7]: from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
X = iris.data
df.sample(4)
34
In [9]: df['species'] = pd.Categorical.from_codes(iris.target,
iris.target_names)
df.head()
df.tail()
35
EXPERIMENT NO:9
Theory:
In some cases, K is not clearly defined, and we have to think about the optimal
number of K. K Means clustering performs best data is well separated. When
data points overlapped this clustering is not suitable. K Means is faster as
compare to other clustering technique. It provides strong coupling between
the data points. K Means cluster do not provide clear information regarding
the quality of clusters. Different initial assignment of cluster centroid may
lead to different clusters. Also, K Means algorithm is sensitive to noise. It may
have stuck in local minima.
The goal of clustering is to divide the population or set of data points into a
number of groups so that the data points within each group are more
comparable to one another and different from the data points within the other
groups. It is essentially a grouping of things based on how similar and
different they are to one another.
36
algorithm will categorize the items into k groups or clusters of similarity. To
calculate that similarity, we will use the Euclidean distance as a measurement.
The algorithm works as follows:
Program:
In [ ]: import matplotlib. pyplot as plt
import numpy as np
from sklearn .cluster import KMeans
x = np.array([[5,3], [10,15], [15,12], [24,10], [30,45],
[85,70], [71,80], [60,78], [55,52], [80,91]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(x)
print(kmeans.labels_)
print(kmeans.cluster_centers_)
plt.scatter(x[:,0], x[:,1], label = 'trueposition')
Out[ ]:
<matplotlib.collections.PathCollection at 0x78428cb42080>
37
In [ ]: kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
print(kmeans.cluster_centers_)
[[70.2 74.2]
[16.8 17. ]]
In [ ]: print(kmeans.labels_)
Out[ ]: [1 1 1 1 1 0 0 0 0 0]
Out[ ]:
<matplotlib.collections.PathCollection at 0x7842876e5e70>
Out[ ]:
<matplotlib.collections.PathCollection at 0x784287a62d10>
38
In [2]: import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
plt.scatter(x[:,0], x[:,1], label = 'TruePosition')
Out[2]:
<matplotlib.collections.PathCollection at 0x7b286e9dd720>
39
In [3]: kmeans = KMeans(n_clusters=2)
kmeans.fit(x)
print(kmeans.cluster_centers_)
Out[ ]: [[6.30103093 2.88659794 4.95876289 1.69587629]
[5.00566038 3.36981132 1.56037736 0.29056604]]
In [5]: print(kmeans.labels_)
Out[ ]:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0 0]
Out[6]:
<matplotlib.collections.PathCollection at 0x7b2868a699f0>
40
In [7]: kmeans = KMeans(n_clusters=3)
kmeans.fit(x)
print(kmeans.cluster_centers_)
Out[ ]: [[6.85384615 3.07692308 5.71538462 2.05384615]
[5.006 3.428 1.462 0.246 ]
[5.88360656 2.74098361 4.38852459 1.43442623]]
In [8]: print(kmeans.labels_)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2
2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0
0
0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0 0 2 0 0 0 2 0 0 0 2
0
0 2]
41
EXPERIMENT NO 10:
Theory:
As the number of features or dimensions in a dataset increases, the amount of
data required to obtain a statistically significant result increases
exponentially. This can lead to issues such as overfitting, increased
computation time, and reduced accuracy of machine learning models. This is
known as the curse of dimensionality problems that arise while working with
high-dimensional data.
Program:
42
print(data)
In [4]: data.shape
Out [5]:[[ 0.69 -1.31 0.39 0.09 1.29 0.49 0.19 -0.81 -0.31 -0.71]
[ 0.49 -1.21 0.99 0.29 1.09 0.79 -0.31 -0.81 -0.31 -1.01]]
43
Out [7]: Eigenvectors
[[-0.73517866 -0.6778734 ]
[ 0.6778734 -0.73517866]]
Eigenvalues
[0.0490834 1.28402771]
Step 7: Retaining only those eigenvalues having maximum values, tranform data and display.
44
In [13]: transformedData = [transformedData1, transformedData2]
transformedData = np.transpose(transformedData)
print(transformedData)
Out [13]:
[[-0.82797019 -0.17511531]
[ 1.77758033 0.14285723]
[-0.99219749 0.38437499]
[-0.27421042 0.13041721]
[-1.67580142 -0.20949846]
[-0.9129491 0.17528244]
[ 0.09910944 -0.3498247 ]
[ 1.14457216 0.04641726]
[ 0.43804614 0.01776463]
[ 1.22382056 -0.16267529]]
Out [16]:
[[2.5 2.4]
[0.5 0.7]
[2.2 2.9]
[1.9 2.2]
[3.1 3. ]
[2.3 2.7]
[2. 1.6]
[1. 1.1]
[1.5 1.6]
[1.1 0.9]]
45
EXPERIMENT NO: 11
In [8]: plt.plot(x1,y1)
Out[8]:
[<matplotlib.lines.Line2D at 0x7f1efc951060>]
In [9]: np.cov([x1,y1])
Out[9]:
array([[ 9.16666667, -9.16666667],
[-9.16666667, 9.16666667]])
In [10]: x2 = np.arange(0,10)
46
y2 = np.array([2]*10)
plt.plot(x2,y2)
Out[10]:
[<matplotlib.lines.Line2D at 0x7f1efc7694b0>]
In [12]: x3 = np.array([2]*10)
y3 = np.arange(0,10)
plt.plot(x3,y3)
Out[12]:
[<matplotlib.lines.Line2D at 0x7f1efc5d7640>]
47
In [13]: np.cov([x3,y3])
Out[13]:
array([[0. , 0. ],
[0. , 9.16666667]])
Out[15]:
48
In [16]: X = iris.data
X.shape
Out[16]: (150, 4)
Out[19]:
array([[ 1.00671141, -0.11835884, 0.87760447, 0.82343066],
[-0.11835884, 1.00671141, -0.43131554, -0.36858315],
[ 0.87760447, -0.43131554, 1.00671141, 0.96932762],
[ 0.82343066, -0.36858315, 0.96932762, 1.00671141]])
Eigenvectors
[[ 0.52106591 -0.37741762 -0.71956635 0.26128628]
[-0.26934744 -0.92329566 0.24438178 -0.12350962]
[ 0.5804131 -0.02449161 0.14212637 -0.80144925]
[ 0.56485654 -0.06694199 0.63427274 0.52359713]]
Eigenvalues
[2.93808505 0.9201649 0.14774182 0.02085386]
49
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])
Out[9]:
Eigenvalues in descending order:
2.938085050199995
0.9201649041624864
0.1477418210449475
0.020853862176462696
Out[24]:
Matrix W:
[[ 0.52106591 -0.37741762]
[-0.26934744 -0.92329566]
[ 0.5804131 -0.02449161]
[ 0.56485654 -0.06694199]]
In [25]: Y = X_std.dot(matrix_w)
print (Y[0:5])
Out[25]:
[[-2.26470281 -0.4800266 ]
[-2.08096115 0.67413356]
[-2.36422905 0.34190802]
[-2.29938422 0.59739451]
[-2.38984217 -0.64683538]]
50
In [28]: pl.figure()
target_names = iris.target_names
y = iris.target
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
pl.scatter(Y[y==i,0], Y[y==i,1], c=c, label=target_name)
pl.xlabel('Principal Component 1')
pl.ylabel('Principal Component 2')
pl.legend()
pl.title('PCA of IRIS dataset')
pl.show()
Out[28]:
51