0% found this document useful (0 votes)

24 views29 pages

Chap5 - Wei - Ipynb - Colab

Uploaded by

Saranya Sarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views29 pages

Chap5 - Wei - Ipynb - Colab

Uploaded by

Saranya Sarkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

8/16/24, 7:49 PM Chap5_wei.

ipynb - Colab

from sklearn import datasets

iris = datasets.load_iris()
iris

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 1/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper.
Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data
points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s
paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.)
The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class
is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n|details-
start|\n**References**\n|details-split|\n\n- Fisher, R.A. "The use of multiple measurements in taxonomic
problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics"
(John Wiley, NY, 1950).\n- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83)
John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A
New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE
Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n- Gates, G.W. (1972) "The
Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n- See also: 1988
MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the
data.\n- Many, many more ...\n\n|details-end|',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}

print(iris.DESCR)

:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 2/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

|details-start|
**References**
|details-split|

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...

|details-end|

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 3/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

print(iris.data) # Features

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 4/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]

print(iris.feature_names) # Feature Names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris.target) # Labels
print(iris.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']

import pandas as pd
df= pd.DataFrame(iris.data)
print(df.head())

0 1 2 3
0 5.1 3.5 1.4 0.2
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 5/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

# data on breast cancer

breast_cancer = datasets.load_breast_cancer()
print(breast_cancer.DESCR)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 6/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

|details-start|
**References**
|details-split|

- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.

|details-end|

import pandas as pd
df= pd.DataFrame(breast_cancer.data)
print(df.head())

0 1 2 3 4 5 6 7 8 \
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809

9 ... 20 21 22 23 24 25 26 27 \
0 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 7/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
1 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860
2 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430
3 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575
4 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625

28 29
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678

[5 rows x 30 columns]

# data on diabetes
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
- age age in years
- sex
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 8/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistic
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

import pandas as pd
df= pd.DataFrame(diabetes.data)
print(df.head())

0 1 2 3 4 5 6 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142

7 8 9
0 -0.002592 0.019907 -0.017646
1 -0.039493 -0.068332 -0.092204
2 -0.002592 0.002861 -0.025930
3 0.034309 0.022688 -0.009362
4 -0.002592 -0.031988 -0.046641

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 9/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

# dataset of 1797 8x8 images of hand-written digits

digits = datasets.load_digits()
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset

--------------------------------------------------

Data Set Characteristics:

:Number of Instances: 1797

:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where

each class refers to a digit.

Preprocessing programs made available by NIST were used to extract

normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.

T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

|details-start|
**References**
|details-split|
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 10/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their

Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.

|details-end|

import pandas as pd
df= pd.DataFrame(digits.data)
print(df.head())

0 1 2 3 4 5 6 7 8 9 ... 54 55 56 \
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0

57 58 59 60 61 62 63
0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 11.0 16.0 10.0 0.0 0.0
2 0.0 0.0 3.0 11.0 16.0 9.0 0.0
3 0.0 7.0 13.0 13.0 9.0 0.0 0.0
4 0.0 0.0 2.0 16.0 4.0 0.0 0.0

[5 rows x 64 columns]

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 11/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

#matplotlib inline
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
X,y = make_regression(n_samples =100, n_features=1, noise=5.4)
plt.scatter(X,y)

<matplotlib.collections.PathCollection at 0x79725ef324a0>

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 12/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
X, y = make_blobs(500, centers =3) # Generate isotropic Gaussian
# blobs for clustering
rgb = np.array(['r', 'g', 'b'])
plt.scatter(X[:, 0], X[:, 1], color=rgb[y])

<matplotlib.collections.PathCollection at 0x79725ea145b0>

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 13/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles
X, y = make_circles(n_samples =100, noise =0.09)
rgb = np.array(['r','g','b'])
plt.scatter(X[:,0], X[:,1], color=rgb[y])

<matplotlib.collections.PathCollection at 0x79725ed46c20>

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 14/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 15/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 16/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 17/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

from sklearn.linear_model import LinearRegression

# create and fit the model
model = LinearRegression()
model.fit(X=heights, y=weights)

▾ LinearRegression
LinearRegression()

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 18/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

# make prediction
weight = model.predict([[1.75]])
print(weight)

[[76.03876501]]

import matplotlib.pyplot as plt

# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)
# plot the regression line
plt.plot(heights, model.predict(heights), color='r')

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 19/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

[<matplotlib.lines.Line2D at 0x79725edb29b0>]

plt.title('Weights plotted against heights')

plt.xlabel('Heights in meters')
plt.ylabel('Weights in kilograms')
plt.plot(heights, weights, 'k.')
plt.axis([0, 1.85, -200, 200])
plt.grid(True)
# plot the regression line
extreme_heights = [[0], [1.8]]
plt.plot(extreme_heights, model.predict(extreme_heights), color='b')

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 20/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

[<matplotlib.lines.Line2D at 0x79725c49cc70>]

model.predict([[0]])

array([[-104.75454545]])

round(model.predict([[0]])[0][0],2)

-104.75

print(round(model.intercept_[0],2))

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 21/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

-104.75

print(round(model.coef_[0][0],2))

103.31

import numpy as np
print('Residual sum of sqaures: %.2f' %
np.sum((weights-model.predict(heights))**2))

Residual sum of sqaures: 5.34

# test data
heights_test = [[1.58], [1.62], [1.69], [1.76], [1.82]]
weights_test = [[58], [63], [72], [73], [85]]

# Total Sum of Squares (TSS)

weights_test_mean = np.mean(np.ravel(weights_test))
TSS = np.sum((np.ravel(weights_test) - weights_test_mean) ** 2)
print("TSS: %.2f" % TSS)
# Residual Sum of Squares (RSS)
RSS = np.sum((np.ravel(weights_test) - np.ravel(model.predict(heights_test))) ** 2)
print("RSS: %.2f" % RSS)
# R_squared
R_squared = 1 - (RSS / TSS)
print("R-squared: %.2f" % R_squared)

TSS: 430.80
RSS: 24.62
R-squared: 0.94

# using scikit-learn to calculate r-squared

print('R-squared: %.4f' % model.score(heights_test, weights_test))

R-squared: 0.9429

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 22/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

import pickle
# save the model to disk
filename = 'HeightsAndWeights_model.sav'
# write to the file using write and binary mode
pickle.dump(model, open(filename, 'wb'))

# load the model from disk

loaded_model = pickle.load(open(filename, 'rb'))

result = loaded_model.score(heights_test, weights_test)

result

0.9428592885995254

pip install joblib

Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (1.4.2)

import pandas as pd
df= pd.read_csv('NaNDataset.csv')
df

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 23/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

A B C

0 1 2.0 3

1 4 NaN 6

2 7 NaN 9

3 10 11.0 12

4 13 14.0 15

5 16 17.0 18

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

df.isnull().sum()

A 0

B 2

C 0

dtype: int64

# replace all the NaNs in column B with the average of column B

df.B = df.B.fillna(df.B.mean())
print(df)

A B C
0 1 2.0 3
1 4 11.0 6
2 7 11.0 9
3 10 11.0 12
4 13 14.0 15
5 16 17.0 18

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 24/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

df= pd.read_csv('NaNDataset.csv')
df

A B C

0 1 2.0 3

1 4 NaN 6

2 7 NaN 9

3 10 11.0 12

4 13 14.0 15

5 16 17.0 18

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

df= df.dropna()
df

A B C

0 1 2.0 3

3 10 11.0 12

4 13 14.0 15

5 16 17.0 18

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 25/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

df= df.reset_index(drop=True)
df

A B C

0 1 2.0 3

1 10 11.0 12

2 13 14.0 15

3 16 17.0 18

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

df= pd.read_csv('DuplicateRows.csv')
df

A B C

0 1 2 3

1 4 5 6

2 4 5 6

3 7 8 9

4 7 18 9

5 10 11 12

6 10 11 12

7 13 14 15

8 16 17 18

Next steps: Generate code with df

toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 26/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

df.duplicated()

0 False

1 False

2 True

3 False

4 False

5 False

6 True

7 False

8 False

dtype: bool

df.duplicated(keep=False)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 27/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

0 False

1 True

2 True

3 False

4 False

5 True

6 True

7 False

8 False

dtype: bool

df[df.duplicated(keep= False)]

A B C

1 4 5 6

2 4 5 6

5 10 11 12

6 10 11 12

df.drop_duplicates(keep ='first', inplace = True) # remove duplicates and keep the first
df

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 28/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

A B C

0 1 2 3

1 4 5 6

3 7 8 9

4 7 18 9

5 10 11 12

7 13 14Generate
Next steps:
8 16
15

17
code with df

18
toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 29/29

Python For Kids
No ratings yet
Python For Kids
19 pages
Dsbda Ouput 1-10
No ratings yet
Dsbda Ouput 1-10
89 pages
Assignment - 10 - Pandas
No ratings yet
Assignment - 10 - Pandas
53 pages
Big O Notation Cheat Sheet - Leetcode Cheat Sheet - La Vivien Post1233
No ratings yet
Big O Notation Cheat Sheet - Leetcode Cheat Sheet - La Vivien Post1233
5 pages
ML LabReport Final Index Edited
No ratings yet
ML LabReport Final Index Edited
35 pages
Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst
From Everand
Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst
Dean Abbott
No ratings yet
Week 6 (PCA, SVD, LDA)
No ratings yet
Week 6 (PCA, SVD, LDA)
14 pages
Mlpy 2
No ratings yet
Mlpy 2
18 pages
I, S, V D: Mporting Ummarizing AND Isualizing ATA
No ratings yet
I, S, V D: Mporting Ummarizing AND Isualizing ATA
18 pages
Lab Manual ML
No ratings yet
Lab Manual ML
23 pages
KNN Datacamp
No ratings yet
KNN Datacamp
31 pages
04 Installation Instruction For Shutdown Unit
100% (3)
04 Installation Instruction For Shutdown Unit
19 pages
Power-BI-for-Beginners 5
No ratings yet
Power-BI-for-Beginners 5
107 pages
Iris Flower Classification
No ratings yet
Iris Flower Classification
47 pages
KRAI LabManual
No ratings yet
KRAI LabManual
77 pages
Dahua SPP Price List Q2 - 2024 - June - 17062024
No ratings yet
Dahua SPP Price List Q2 - 2024 - June - 17062024
23 pages
Váy Cho Mẹ
No ratings yet
Váy Cho Mẹ
26 pages
Lab Manual
No ratings yet
Lab Manual
32 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
8 pages
ML Project Assigment
No ratings yet
ML Project Assigment
32 pages
Session 3-Design Requirements of BMUs
100% (1)
Session 3-Design Requirements of BMUs
37 pages
Python Final Project Group 03
No ratings yet
Python Final Project Group 03
18 pages
Support Vector Machine (SVM Classifier) Implemenation in Python With Scikit-Learn
No ratings yet
Support Vector Machine (SVM Classifier) Implemenation in Python With Scikit-Learn
21 pages
ML N PY Programs
No ratings yet
ML N PY Programs
17 pages
Machine Learning Group Project
No ratings yet
Machine Learning Group Project
22 pages
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
No ratings yet
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
14 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
ML#07
No ratings yet
ML#07
21 pages
Batch1 Ds
No ratings yet
Batch1 Ds
15 pages
04 SVM
No ratings yet
04 SVM
8 pages
Pandas Exercises
No ratings yet
Pandas Exercises
15 pages
Data Science Programs
No ratings yet
Data Science Programs
11 pages
MLRecord
No ratings yet
MLRecord
24 pages
ML Group 2
No ratings yet
ML Group 2
16 pages
DSBDA6
No ratings yet
DSBDA6
6 pages
Lab4 KNN
No ratings yet
Lab4 KNN
9 pages
Machine Learning Algorithm
No ratings yet
Machine Learning Algorithm
18 pages
Lect7 Skrearing
No ratings yet
Lect7 Skrearing
23 pages
Exno 4
No ratings yet
Exno 4
13 pages
18AIL78 - Lab Manual
No ratings yet
18AIL78 - Lab Manual
25 pages
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
Normalization and PCA
No ratings yet
Normalization and PCA
12 pages
Task 1
No ratings yet
Task 1
14 pages
SQL MCQ Question and Answers - Ii: Questiontext Questiontype Choice1
No ratings yet
SQL MCQ Question and Answers - Ii: Questiontext Questiontype Choice1
76 pages
Tarea - 1.ipynb - Colab Jose
No ratings yet
Tarea - 1.ipynb - Colab Jose
12 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
DS Report
No ratings yet
DS Report
11 pages
Assignment 5'
No ratings yet
Assignment 5'
4 pages
Dataset Iris Flower. Final
No ratings yet
Dataset Iris Flower. Final
7 pages
Iris - Copy1 - Jupyter Notebook
No ratings yet
Iris - Copy1 - Jupyter Notebook
8 pages
How To Write A Winning Proposal
No ratings yet
How To Write A Winning Proposal
5 pages
Experiment-2-1-Ml Kritika
No ratings yet
Experiment-2-1-Ml Kritika
11 pages
Iris Recognition - Jupyter Notebook
No ratings yet
Iris Recognition - Jupyter Notebook
8 pages
Jazz Chants
No ratings yet
Jazz Chants
7 pages
Task 1 Iris Flower Classification Using Machine Learning
No ratings yet
Task 1 Iris Flower Classification Using Machine Learning
10 pages
1 Assignment 3 - Classification
No ratings yet
1 Assignment 3 - Classification
16 pages
Practical No - 1
No ratings yet
Practical No - 1
5 pages
6.CNS Unit Wise Question Bank
100% (2)
6.CNS Unit Wise Question Bank
2 pages
Data Visualization 3
No ratings yet
Data Visualization 3
3 pages
MANUAL On Target Spray Systems Parts Catolog
No ratings yet
MANUAL On Target Spray Systems Parts Catolog
21 pages
Fx350 Series
No ratings yet
Fx350 Series
48 pages
Image Restoration
No ratings yet
Image Restoration
92 pages
EXP 07 (ML) - Sarthak
No ratings yet
EXP 07 (ML) - Sarthak
4 pages
EXP 07 (ML) - Ashu
No ratings yet
EXP 07 (ML) - Ashu
4 pages
EXP 07 (ML) - Darshu
No ratings yet
EXP 07 (ML) - Darshu
4 pages
Exp 07 (ML)
No ratings yet
Exp 07 (ML)
4 pages
Summary Data
No ratings yet
Summary Data
2 pages
5-1 Dataframes Intro Load Inspect - Instruction
No ratings yet
5-1 Dataframes Intro Load Inspect - Instruction
2 pages
Starlink Tech
No ratings yet
Starlink Tech
4 pages
COC Application Tutorial
No ratings yet
COC Application Tutorial
20 pages
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
No ratings yet
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
4 pages
Python (Visualization)
No ratings yet
Python (Visualization)
3 pages
Bank Statement
No ratings yet
Bank Statement
7 pages
Data Science: Objectives
No ratings yet
Data Science: Objectives
10 pages
Lab 3 - SciKitLearn ML
No ratings yet
Lab 3 - SciKitLearn ML
2 pages
Skills Assessment - Student Training Exam: (Cerinte: Pagina 10)
No ratings yet
Skills Assessment - Student Training Exam: (Cerinte: Pagina 10)
11 pages
Atm Srs
No ratings yet
Atm Srs
12 pages
# Common Datatype: Print Type Print Type Print Type Print Type Print Type
No ratings yet
# Common Datatype: Print Type Print Type Print Type Print Type Print Type
4 pages
Parijat College of Pharmacy, Indore: Rajiv Gandhi Proudyogiki Vishwavidyalaya-Bhopal
No ratings yet
Parijat College of Pharmacy, Indore: Rajiv Gandhi Proudyogiki Vishwavidyalaya-Bhopal
1 page
4163 Drilling Systems Product Sheet Ds5
No ratings yet
4163 Drilling Systems Product Sheet Ds5
2 pages
Time Series Models. AR, MA, ARMA, ARIMA - by Charanraj Shetty - Towards Data Science
No ratings yet
Time Series Models. AR, MA, ARMA, ARIMA - by Charanraj Shetty - Towards Data Science
3 pages
Fri Group4
No ratings yet
Fri Group4
31 pages
WINSEM2021-22 CSE1004 ETH VL2021220501997 Reference Material I 26-04-2022 Module-7-HTTP
No ratings yet
WINSEM2021-22 CSE1004 ETH VL2021220501997 Reference Material I 26-04-2022 Module-7-HTTP
18 pages
22.22 - Website List of 968 Candidates - Sep. 2024 PDF
No ratings yet
22.22 - Website List of 968 Candidates - Sep. 2024 PDF
23 pages
Linear Regression Exampl e
No ratings yet
Linear Regression Exampl e
14 pages
ANT-ADU4518R6-0997-001 Datasheet
No ratings yet
ANT-ADU4518R6-0997-001 Datasheet
2 pages
JBL SRX815 SpecSheet 11 11 19
No ratings yet
JBL SRX815 SpecSheet 11 11 19
4 pages
WINSEM2021-22 CSE1004 ETH VL2021220501997 Reference Material II 13-04-2022 Module-6-Congestion Control - Transport Layer - Notes
No ratings yet
WINSEM2021-22 CSE1004 ETH VL2021220501997 Reference Material II 13-04-2022 Module-6-Congestion Control - Transport Layer - Notes
7 pages
Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition
No ratings yet
Parallelization of A Neural Network Algorithm For Use in Handwriting Recognition
6 pages
PA Research Papers
No ratings yet
PA Research Papers
5 pages
BITSAT 2019 Application Form - Bits, Pilani
No ratings yet
BITSAT 2019 Application Form - Bits, Pilani
1 page
Virtual Reality in The Army - Kashikasingh - A2305220439
No ratings yet
Virtual Reality in The Army - Kashikasingh - A2305220439
10 pages
C Programming MCQ Question Paper
No ratings yet
C Programming MCQ Question Paper
3 pages
Proofing and Data Validation
No ratings yet
Proofing and Data Validation
6 pages
Brainly - NDA (SMEs) - 2021
No ratings yet
Brainly - NDA (SMEs) - 2021
6 pages
Final - Brochur Igbc 2022
No ratings yet
Final - Brochur Igbc 2022
2 pages

Chap5 - Wei - Ipynb - Colab

Uploaded by

Chap5 - Wei - Ipynb - Colab

Uploaded by

8/16/24, 7:49 PM Chap5_wei.

from sklearn import datasets

:Missing Attribute Values: None

This is perhaps the best known database to be found in the

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

print(iris.feature_names) # Feature Names

# data on breast cancer

This database is also available through the UW CS ftp server:

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

For more information see:

# dataset of 1797 8x8 images of hand-written digits

Optical recognition of handwritten digits dataset

**Data Set Characteristics:**

:Number of Instances: 1797

The data set contains images of hand-written digits: 10 classes where

Preprocessing programs made available by NIST were used to extract

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

plt.title('Weights plotted against heights')

Residual sum of sqaures: 5.34

# Total Sum of Squares (TSS)

# using scikit-learn to calculate r-squared

# load the model from disk

result = loaded_model.score(heights_test, weights_test)

pip install joblib

Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (1.4.2)

Next steps: Generate code with df

# replace all the NaNs in column B with the average of column B

Next steps: Generate code with df

Next steps: Generate code with df

Next steps: Generate code with df

Next steps: Generate code with df

You might also like

Data Set Characteristics:

Data Set Characteristics: