0% found this document useful (0 votes)
9 views6 pages

17.11.24 - Jupyter Notebook - Doc

The document provides a series of Jupyter Notebook code snippets demonstrating various data preprocessing techniques on the Pima Indians Diabetes dataset. Techniques include rescaling, standardizing, normalizing, binarizing, and feature extraction using methods like Chi-squared, RFE, and PCA. Each section includes code for loading the dataset, processing the data, and summarizing the results.

Uploaded by

info.walleynk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

17.11.24 - Jupyter Notebook - Doc

The document provides a series of Jupyter Notebook code snippets demonstrating various data preprocessing techniques on the Pima Indians Diabetes dataset. Techniques include rescaling, standardizing, normalizing, binarizing, and feature extraction using methods like Chi-squared, RFE, and PCA. Each section includes code for loading the dataset, processing the data, and summarizing the results.

Uploaded by

info.walleynk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

17.11 - Jupyter Notebook http://localhost:8888/notebooks/17.11.

2024_Jupiter_Notebook_doc#

In [4]: # Rescale data (between 0 and 1)


from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components


X = array[:,0:8]
Y = array[:,8]

scaler = MinMaxScaler(feature_range=(0, 1))


rescaledX = scaler.fit_transform(X)

# summarize transformed data


set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[0.353 0.744 0.59 0.354 0. 0.501 0.234 0.483]


[0.059 0.427 0.541 0.293 0. 0.396 0.117 0.167]
[0.471 0.92 0.525 0. 0. 0.347 0.254 0.183]
[0.059 0.447 0.541 0.232 0.111 0.419 0.038 0. ]
[0. 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]

In [7]: # Standardize data (0 mean, 1 stdev)


from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import StandardScaler

filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components


X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]


[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]

1 of 6 11/17/2024, 4:50 PM
17.11 - Jupyter Notebook http://localhost:8888/notebooks/17.11.2024_Jupiter_Notebook_doc#

In [8]: # Normalize data (length of 1)


from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer

filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components


X = array[:,0:8]
Y = array[:,8]

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data


set_printoptions(precision=3)
print(normalizedX[0:5,:])

[[0.034 0.828 0.403 0.196 0. 0.188 0.004 0.28 ]


[0.008 0.716 0.556 0.244 0. 0.224 0.003 0.261]
[0.04 0.924 0.323 0. 0. 0.118 0.003 0.162]
[0.007 0.588 0.436 0.152 0.622 0.186 0.001 0.139]
[0. 0.596 0.174 0.152 0.731 0.188 0.01 0.144]]

In [10]: # binarization
from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Binarizer

filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values

# separate array into input and output components


X = array[:,0:8]
Y = array[:,8]

binarizer = Binarizer(threshold=6.0).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data


set_printoptions(precision=3)
print(binaryX[0:5,:])

[[0. 1. 1. 1. 0. 1. 0. 1.]
[0. 1. 1. 1. 0. 1. 0. 1.]
[1. 1. 1. 0. 0. 1. 0. 1.]
[0. 1. 1. 1. 1. 1. 0. 1.]
[0. 1. 1. 1. 1. 1. 0. 1.]]

2 of 6 11/17/2024, 4:50 PM
17.11 - Jupyter Notebook http://localhost:8888/notebooks/17.11.2024_Jupiter_Notebook_doc#

In [12]: # Feature Extraction with Univariate Statistical Tests (Chi-squared for classification
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# load data
filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

# feature extraction
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(X, Y)

# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

# summarize selected features


print(features[0:5,:])

[ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]


[[148. 0. 50.]
[ 85. 0. 31.]
[183. 0. 32.]
[ 89. 94. 21.]
[137. 168. 33.]]

3 of 6 11/17/2024, 4:50 PM
17.11 - Jupyter Notebook http://localhost:8888/notebooks/17.11.2024_Jupiter_Notebook_doc#

In [21]: # Feature Extraction with RFE


from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# load data
filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# feature extraction
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=3)
fit = rfe.fit(X, Y)

print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)

C:\Users\CSE\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:45
8: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html (https://sciki
t-learn.org/stable/modules/preprocessing.html)
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regres
sion (https://scikit-learn.org/stable/modules/linear_model.html#logistic-regr
ession)
n_iter_i = _check_optimize_result(

3
[ True False False False False True True False]
[1 2 4 5 6 1 1 3]

4 of 6 11/17/2024, 4:50 PM
17.11 - Jupyter Notebook http://localhost:8888/notebooks/17.11.2024_Jupiter_Notebook_doc#

In [23]: # Feature Extraction with PCA


from pandas import read_csv
from sklearn.decomposition import PCA

# load data
filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# feature extraction
pca = PCA(n_components=4)
fit = pca.fit(X)

# summarize components
print(fit.explained_variance_ratio_)
print(fit.components_)

[0.889]
[[-2.022e-03 9.781e-02 1.609e-02 6.076e-02 9.931e-01 1.401e-02
5.372e-04 -3.565e-03]]

5 of 6 11/17/2024, 4:50 PM
17.11 - Jupyter Notebook http://localhost:8888/notebooks/17.11.2024_Jupiter_Notebook_doc#

In [25]: # Feature Extraction with PCA


from pandas import read_csv
from sklearn.decomposition import PCA

# load data
filename = 'D:\\Dataset\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'
dataframe = read_csv(filename, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

# Standardize the features


X_scaled = StandardScaler().fit_transform(X)

# Apply PCA and automatically select the number of components to explain 95% of the va
pca = PCA(n_components=0.75)
X_pca = pca.fit_transform(X_scaled)

# Output: Number of components and explained variance


print(f"Number of components explaining 95% of variance: {X_pca.shape[1]}")
print(f"Explained Variance (in percentage): {pca.explained_variance_ratio_ * 100
print(f"Cumulative Explained Variance: {sum(pca.explained_variance_ratio_) * 100

# Show the transformed data (first 5 samples)


print("PCA Transformed Data (first 5 samples):\n", X_pca[:5])

Number of components explaining 95% of variance: 5


Explained Variance (in percentage): [26.18 21.64 12.87 10.944 9.529]
Cumulative Explained Variance: 81.16%
PCA Transformed Data (first 5 samples):
[[ 1.069 1.235 0.096 0.497 -0.11 ]
[-1.122 -0.734 -0.713 0.285 -0.39 ]
[-0.396 1.596 1.761 -0.07 0.906]
[-1.116 -1.271 -0.664 -0.579 -0.356]
[ 2.359 -2.185 2.963 4.033 0.593]]

In [ ]:

6 of 6 11/17/2024, 4:50 PM

You might also like