0% found this document useful (0 votes)
24 views29 pages

Chap5 - Wei - Ipynb - Colab

Uploaded by

Saranya Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views29 pages

Chap5 - Wei - Ipynb - Colab

Uploaded by

Saranya Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

8/16/24, 7:49 PM Chap5_wei.

ipynb - Colab

from sklearn import datasets


iris = datasets.load_iris()
iris

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 1/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper.
Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data
points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s
paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.)
The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class
is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n|details-
start|\n**References**\n|details-split|\n\n- Fisher, R.A. "The use of multiple measurements in taxonomic
problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics"
(John Wiley, NY, 1950).\n- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83)
John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A
New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE
Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n- Gates, G.W. (1972) "The
Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n- See also: 1988
MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the
data.\n- Many, many more ...\n\n|details-end|',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}

print(iris.DESCR)

:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 2/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None


:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the


pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

|details-start|
**References**
|details-split|

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"


Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...

|details-end|

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 3/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

print(iris.data) # Features

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 4/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]

print(iris.feature_names) # Feature Names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris.target) # Labels
print(iris.target_names)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']

import pandas as pd
df= pd.DataFrame(iris.data)
print(df.head())

0 1 2 3
0 5.1 3.5 1.4 0.2
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 5/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

# data on breast cancer


breast_cancer = datasets.load_breast_cancer()
print(breast_cancer.DESCR)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 6/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

|details-start|
**References**
|details-split|

- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.

|details-end|

import pandas as pd
df= pd.DataFrame(breast_cancer.data)
print(df.head())

0 1 2 3 4 5 6 7 8 \
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809

9 ... 20 21 22 23 24 25 26 27 \
0 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 7/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
1 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860
2 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430
3 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575
4 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625

28 29
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678

[5 rows x 30 columns]

# data on diabetes
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
- age age in years
- sex
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 8/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:


Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistic
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

import pandas as pd
df= pd.DataFrame(diabetes.data)
print(df.head())

0 1 2 3 4 5 6 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142

7 8 9
0 -0.002592 0.019907 -0.017646
1 -0.039493 -0.068332 -0.092204
2 -0.002592 0.002861 -0.025930
3 0.034309 0.022688 -0.009362
4 -0.002592 -0.031988 -0.046641

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 9/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

# dataset of 1797 8x8 images of hand-written digits


digits = datasets.load_digits()
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset


--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 1797


:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where


each class refers to a digit.

Preprocessing programs made available by NIST were used to extract


normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.


T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

|details-start|
**References**
|details-split|
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 10/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their


Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.

|details-end|

import pandas as pd
df= pd.DataFrame(digits.data)
print(df.head())

0 1 2 3 4 5 6 7 8 9 ... 54 55 56 \
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0

57 58 59 60 61 62 63
0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 11.0 16.0 10.0 0.0 0.0
2 0.0 0.0 3.0 11.0 16.0 9.0 0.0
3 0.0 7.0 13.0 13.0 9.0 0.0 0.0
4 0.0 0.0 2.0 16.0 4.0 0.0 0.0

[5 rows x 64 columns]

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 11/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

#matplotlib inline
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
X,y = make_regression(n_samples =100, n_features=1, noise=5.4)
plt.scatter(X,y)

<matplotlib.collections.PathCollection at 0x79725ef324a0>

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 12/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
X, y = make_blobs(500, centers =3) # Generate isotropic Gaussian
# blobs for clustering
rgb = np.array(['r', 'g', 'b'])
plt.scatter(X[:, 0], X[:, 1], color=rgb[y])

<matplotlib.collections.PathCollection at 0x79725ea145b0>

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 13/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles
X, y = make_circles(n_samples =100, noise =0.09)
rgb = np.array(['r','g','b'])
plt.scatter(X[:,0], X[:,1], color=rgb[y])

<matplotlib.collections.PathCollection at 0x79725ed46c20>

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 14/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 15/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 16/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

%matplotlib inline
import matplotlib.pyplot as plt
# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 17/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

from sklearn.linear_model import LinearRegression


# create and fit the model
model = LinearRegression()
model.fit(X=heights, y=weights)

▾ LinearRegression
LinearRegression()

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 18/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

# make prediction
weight = model.predict([[1.75]])
print(weight)

[[76.03876501]]

import matplotlib.pyplot as plt


# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)
# plot the regression line
plt.plot(heights, model.predict(heights), color='r')

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 19/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

[<matplotlib.lines.Line2D at 0x79725edb29b0>]

plt.title('Weights plotted against heights')


plt.xlabel('Heights in meters')
plt.ylabel('Weights in kilograms')
plt.plot(heights, weights, 'k.')
plt.axis([0, 1.85, -200, 200])
plt.grid(True)
# plot the regression line
extreme_heights = [[0], [1.8]]
plt.plot(extreme_heights, model.predict(extreme_heights), color='b')

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 20/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

[<matplotlib.lines.Line2D at 0x79725c49cc70>]

model.predict([[0]])

array([[-104.75454545]])

round(model.predict([[0]])[0][0],2)

-104.75

print(round(model.intercept_[0],2))

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 21/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

-104.75

print(round(model.coef_[0][0],2))

103.31

import numpy as np
print('Residual sum of sqaures: %.2f' %
np.sum((weights-model.predict(heights))**2))

Residual sum of sqaures: 5.34

# test data
heights_test = [[1.58], [1.62], [1.69], [1.76], [1.82]]
weights_test = [[58], [63], [72], [73], [85]]

# Total Sum of Squares (TSS)


weights_test_mean = np.mean(np.ravel(weights_test))
TSS = np.sum((np.ravel(weights_test) - weights_test_mean) ** 2)
print("TSS: %.2f" % TSS)
# Residual Sum of Squares (RSS)
RSS = np.sum((np.ravel(weights_test) - np.ravel(model.predict(heights_test))) ** 2)
print("RSS: %.2f" % RSS)
# R_squared
R_squared = 1 - (RSS / TSS)
print("R-squared: %.2f" % R_squared)

TSS: 430.80
RSS: 24.62
R-squared: 0.94

# using scikit-learn to calculate r-squared


print('R-squared: %.4f' % model.score(heights_test, weights_test))

R-squared: 0.9429

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 22/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

import pickle
# save the model to disk
filename = 'HeightsAndWeights_model.sav'
# write to the file using write and binary mode
pickle.dump(model, open(filename, 'wb'))

# load the model from disk


loaded_model = pickle.load(open(filename, 'rb'))

result = loaded_model.score(heights_test, weights_test)

result

0.9428592885995254

pip install joblib

Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (1.4.2)

import pandas as pd
df= pd.read_csv('NaNDataset.csv')
df

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 23/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

A B C

0 1 2.0 3

1 4 NaN 6

2 7 NaN 9

3 10 11.0 12

4 13 14.0 15

5 16 17.0 18

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

df.isnull().sum()

A 0

B 2

C 0

dtype: int64

# replace all the NaNs in column B with the average of column B


df.B = df.B.fillna(df.B.mean())
print(df)

A B C
0 1 2.0 3
1 4 11.0 6
2 7 11.0 9
3 10 11.0 12
4 13 14.0 15
5 16 17.0 18

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 24/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

df= pd.read_csv('NaNDataset.csv')
df

A B C

0 1 2.0 3

1 4 NaN 6

2 7 NaN 9

3 10 11.0 12

4 13 14.0 15

5 16 17.0 18

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

df= df.dropna()
df

A B C

0 1 2.0 3

3 10 11.0 12

4 13 14.0 15

5 16 17.0 18

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 25/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

df= df.reset_index(drop=True)
df

A B C

0 1 2.0 3

1 10 11.0 12

2 13 14.0 15

3 16 17.0 18

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

df= pd.read_csv('DuplicateRows.csv')
df

A B C

0 1 2 3

1 4 5 6

2 4 5 6

3 7 8 9

4 7 18 9

5 10 11 12

6 10 11 12

7 13 14 15

8 16 17 18

Next steps: Generate code with df


toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 26/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

df.duplicated()

0 False

1 False

2 True

3 False

4 False

5 False

6 True

7 False

8 False

dtype: bool

df.duplicated(keep=False)

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 27/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

0 False

1 True

2 True

3 False

4 False

5 True

6 True

7 False

8 False

dtype: bool

df[df.duplicated(keep= False)]

A B C

1 4 5 6

2 4 5 6

5 10 11 12

6 10 11 12

df.drop_duplicates(keep ='first', inplace = True) # remove duplicates and keep the first
df

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 28/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab

A B C

0 1 2 3

1 4 5 6

3 7 8 9

4 7 18 9

5 10 11 12

7 13 14Generate
Next steps:
8 16
15

17
code with df

18
toggle_off View recommended plots New interactive sheet

https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 29/29

You might also like