Chap5 - Wei - Ipynb - Colab
Chap5 - Wei - Ipynb - Colab
ipynb - Colab
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 1/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper.
Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data
points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s
paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.)
The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class
is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n|details-
start|\n**References**\n|details-split|\n\n- Fisher, R.A. "The use of multiple measurements in taxonomic
problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics"
(John Wiley, NY, 1950).\n- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83)
John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A
New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE
Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n- Gates, G.W. (1972) "The
Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n- See also: 1988
MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the
data.\n- Many, many more ...\n\n|details-end|',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}
print(iris.DESCR)
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 2/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
|details-start|
**References**
|details-split|
|details-end|
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 3/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
print(iris.data) # Features
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 4/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.target) # Labels
print(iris.target_names)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
['setosa' 'versicolor' 'virginica']
import pandas as pd
df= pd.DataFrame(iris.data)
print(df.head())
0 1 2 3
0 5.1 3.5 1.4 0.2
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 5/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 6/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
|details-start|
**References**
|details-split|
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.
|details-end|
import pandas as pd
df= pd.DataFrame(breast_cancer.data)
print(df.head())
0 1 2 3 4 5 6 7 8 \
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809
9 ... 20 21 22 23 24 25 26 27 \
0 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 7/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
1 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860
2 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430
3 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575
4 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625
28 29
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678
[5 rows x 30 columns]
# data on diabetes
diabetes = datasets.load_diabetes()
print(diabetes.DESCR)
.. _diabetes_dataset:
Diabetes dataset
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
- age age in years
- sex
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 8/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
- bmi body mass index
- bp average blood pressure
- s1 tc, total serum cholesterol
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, total cholesterol / HDL
- s5 ltg, possibly log of serum triglycerides level
- s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
import pandas as pd
df= pd.DataFrame(diabetes.data)
print(df.head())
0 1 2 3 4 5 6 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142
7 8 9
0 -0.002592 0.019907 -0.017646
1 -0.039493 -0.068332 -0.092204
2 -0.002592 0.002861 -0.025930
3 0.034309 0.022688 -0.009362
4 -0.002592 -0.031988 -0.046641
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 9/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
.. _digits_dataset:
This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
|details-start|
**References**
|details-split|
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 10/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
|details-end|
import pandas as pd
df= pd.DataFrame(digits.data)
print(df.head())
0 1 2 3 4 5 6 7 8 9 ... 54 55 56 \
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
57 58 59 60 61 62 63
0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 11.0 16.0 10.0 0.0 0.0
2 0.0 0.0 3.0 11.0 16.0 9.0 0.0
3 0.0 7.0 13.0 13.0 9.0 0.0 0.0
4 0.0 0.0 2.0 16.0 4.0 0.0 0.0
[5 rows x 64 columns]
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 11/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
#matplotlib inline
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
X,y = make_regression(n_samples =100, n_features=1, noise=5.4)
plt.scatter(X,y)
<matplotlib.collections.PathCollection at 0x79725ef324a0>
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 12/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
X, y = make_blobs(500, centers =3) # Generate isotropic Gaussian
# blobs for clustering
rgb = np.array(['r', 'g', 'b'])
plt.scatter(X[:, 0], X[:, 1], color=rgb[y])
<matplotlib.collections.PathCollection at 0x79725ea145b0>
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 13/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles
X, y = make_circles(n_samples =100, noise =0.09)
rgb = np.array(['r','g','b'])
plt.scatter(X[:,0], X[:,1], color=rgb[y])
<matplotlib.collections.PathCollection at 0x79725ed46c20>
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 14/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
%matplotlib inline
import matplotlib.pyplot as plt
# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 15/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 16/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
%matplotlib inline
import matplotlib.pyplot as plt
# represents the heights of a group of people in meters
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]
plt.title('Height vs Weight')
plt.xlabel('Height in meter')
plt.ylabel('Weight in KG')
plt.plot(heights, weights, 'k.')
# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 17/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
▾ LinearRegression
LinearRegression()
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 18/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
# make prediction
weight = model.predict([[1.75]])
print(weight)
[[76.03876501]]
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 19/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
[<matplotlib.lines.Line2D at 0x79725edb29b0>]
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 20/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
[<matplotlib.lines.Line2D at 0x79725c49cc70>]
model.predict([[0]])
array([[-104.75454545]])
round(model.predict([[0]])[0][0],2)
-104.75
print(round(model.intercept_[0],2))
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 21/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
-104.75
print(round(model.coef_[0][0],2))
103.31
import numpy as np
print('Residual sum of sqaures: %.2f' %
np.sum((weights-model.predict(heights))**2))
# test data
heights_test = [[1.58], [1.62], [1.69], [1.76], [1.82]]
weights_test = [[58], [63], [72], [73], [85]]
TSS: 430.80
RSS: 24.62
R-squared: 0.94
R-squared: 0.9429
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 22/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
import pickle
# save the model to disk
filename = 'HeightsAndWeights_model.sav'
# write to the file using write and binary mode
pickle.dump(model, open(filename, 'wb'))
result
0.9428592885995254
import pandas as pd
df= pd.read_csv('NaNDataset.csv')
df
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 23/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
A B C
0 1 2.0 3
1 4 NaN 6
2 7 NaN 9
3 10 11.0 12
4 13 14.0 15
5 16 17.0 18
df.isnull().sum()
A 0
B 2
C 0
dtype: int64
A B C
0 1 2.0 3
1 4 11.0 6
2 7 11.0 9
3 10 11.0 12
4 13 14.0 15
5 16 17.0 18
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 24/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
df= pd.read_csv('NaNDataset.csv')
df
A B C
0 1 2.0 3
1 4 NaN 6
2 7 NaN 9
3 10 11.0 12
4 13 14.0 15
5 16 17.0 18
df= df.dropna()
df
A B C
0 1 2.0 3
3 10 11.0 12
4 13 14.0 15
5 16 17.0 18
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 25/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
df= df.reset_index(drop=True)
df
A B C
0 1 2.0 3
1 10 11.0 12
2 13 14.0 15
3 16 17.0 18
df= pd.read_csv('DuplicateRows.csv')
df
A B C
0 1 2 3
1 4 5 6
2 4 5 6
3 7 8 9
4 7 18 9
5 10 11 12
6 10 11 12
7 13 14 15
8 16 17 18
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 26/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
df.duplicated()
0 False
1 False
2 True
3 False
4 False
5 False
6 True
7 False
8 False
dtype: bool
df.duplicated(keep=False)
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 27/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
0 False
1 True
2 True
3 False
4 False
5 True
6 True
7 False
8 False
dtype: bool
df[df.duplicated(keep= False)]
A B C
1 4 5 6
2 4 5 6
5 10 11 12
6 10 11 12
df.drop_duplicates(keep ='first', inplace = True) # remove duplicates and keep the first
df
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 28/29
8/16/24, 7:49 PM Chap5_wei.ipynb - Colab
A B C
0 1 2 3
1 4 5 6
3 7 8 9
4 7 18 9
5 10 11 12
7 13 14Generate
Next steps:
8 16
15
17
code with df
18
toggle_off View recommended plots New interactive sheet
https://colab.research.google.com/drive/1uNkwUQpvKokHfP02z0dmKYV0v9A2kqyn#scrollTo=3IpUfelqgptA&printMode=true 29/29