0% found this document useful (0 votes)
27 views11 pages

ML LAB 12 - Jupyter Notebook

This document discusses analyzing wine quality data using machine learning techniques in Python. It loads wine quality data, cleans the data by removing correlated features and imputing missing values. It then encodes categorical features and creates a new target variable to indicate the "best quality" wines. This prepares the data for further machine learning modeling and analysis.

Uploaded by

rishitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

ML LAB 12 - Jupyter Notebook

This document discusses analyzing wine quality data using machine learning techniques in Python. It loads wine quality data, cleans the data by removing correlated features and imputing missing values. It then encodes categorical features and creates a new target variable to indicate the "best quality" wines. This prepares the data for further machine learning modeling and analysis.

Uploaded by

rishitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

ML LAB-12
In [ ]: 

import numpy as np
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import norm

In [ ]: 

df=pd.read_csv('winequalityN.csv')
df.head()

Out[3]:

free total
fixed volatile citric residual
type chlorides sulfur sulfur density pH sulphates
acidity acidity acid sugar
dioxide dioxide

0 white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45

1 white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49

2 white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44

3 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40

4 white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40

In [ ]: 

dfv=df['quality'].value_counts()
print(dfv)

6 2836

5 2138

7 1079

4 216

8 193

3 30

9 5

Name: quality, dtype: int64

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 1/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

for a in range(len(df.corr().columns)):
for b in range(a):
if abs(df.corr().iloc[a,b]) >0.7:
name = df.corr().columns[a]
print(name)

total sulfur dioxide

In [ ]: 

new_df=df.drop('total sulfur dioxide',axis=1)


total = new_df.isnull().sum().sort_values(ascending=False)
percent = (new_df.isnull().sum()/new_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

Out[6]:

Total Percent

fixed acidity 10 0.001539

pH 9 0.001385

volatile acidity 8 0.001231

sulphates 4 0.000616

citric acid 3 0.000462

chlorides 2 0.000308

residual sugar 2 0.000308

quality 0 0.000000

alcohol 0 0.000000

density 0 0.000000

free sulfur dioxide 0 0.000000

type 0 0.000000

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 2/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

next_df = pd.get_dummies(new_df,drop_first=True)
next_df

Out[7]:

free
fixed volatile citric residual
chlorides sulfur density pH sulphates alcohol qu
acidity acidity acid sugar
dioxide

0 7.0 0.270 0.36 20.7 0.045 45.0 1.00100 3.00 0.45 8.8

1 6.3 0.300 0.34 1.6 0.049 14.0 0.99400 3.30 0.49 9.5

2 8.1 0.280 0.40 6.9 0.050 30.0 0.99510 3.26 0.44 10.1

3 7.2 0.230 0.32 8.5 0.058 47.0 0.99560 3.19 0.40 9.9

4 7.2 0.230 0.32 8.5 0.058 47.0 0.99560 3.19 0.40 9.9

... ... ... ... ... ... ... ... ... ... ...

6492 6.2 0.600 0.08 2.0 0.090 32.0 0.99490 3.45 0.58 10.5

6493 5.9 0.550 0.10 2.2 0.062 39.0 0.99512 3.52 NaN 11.2

6494 6.3 0.510 0.13 2.3 0.076 29.0 0.99574 3.42 0.75 11.0

6495 5.9 0.645 0.12 2.0 0.075 32.0 0.99547 3.57 0.71 10.2

6496 6.0 0.310 0.47 3.6 0.067 18.0 0.99549 3.39 0.66 11.0

6497 rows × 12 columns

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 3/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

next_df1=next_df
next_df1["best quality"] = [ 1 if x>=7 else 0 for x in df.quality]
print(next_df1)

fixed acidity volatile acidity citric acid residual sugar chloride


s \

0 7.0 0.270 0.36 20.7 0.04


5

1 6.3 0.300 0.34 1.6 0.04


9

2 8.1 0.280 0.40 6.9 0.05


0

3 7.2 0.230 0.32 8.5 0.05


8

4 7.2 0.230 0.32 8.5 0.05


8

... ... ... ... ...


...

6492 6.2 0.600 0.08 2.0 0.09


0

6493 5.9 0.550 0.10 2.2 0.06


2

6494 6.3 0.510 0.13 2.3 0.07


6

6495 5.9 0.645 0.12 2.0 0.07


5

6496 6.0 0.310 0.47 3.6 0.06


7

free sulfur dioxide density pH sulphates alcohol quality \

0 45.0 1.00100 3.00 0.45 8.8 6

1 14.0 0.99400 3.30 0.49 9.5 6

2 30.0 0.99510 3.26 0.44 10.1 6

3 47.0 0.99560 3.19 0.40 9.9 6

4 47.0 0.99560 3.19 0.40 9.9 6

... ... ... ... ... ... ...

6492 32.0 0.99490 3.45 0.58 10.5 5

6493 39.0 0.99512 3.52 NaN 11.2 6

6494 29.0 0.99574 3.42 0.75 11.0 6

6495 32.0 0.99547 3.57 0.71 10.2 5

6496 18.0 0.99549 3.39 0.66 11.0 6

type_white best quality

0 1 0

1 1 0

2 1 0

3 1 0

4 1 0

... ... ...

6492 0 0

6493 0 0

6494 0 0

6495 0 0

6496 0 0

[6497 rows x 13 columns]

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 4/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

features=(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',


'chlorides', 'free sulfur dioxide', 'density', 'pH', 'sulphates',
'alcohol','type_white', 'best quality'])

In [ ]: 

X=next_df[features]
y=next_df['quality']

In [ ]: 

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

In [ ]: 

from sklearn.preprocessing import MinMaxScaler


norm = MinMaxScaler()
norm_fit = norm.fit(X_train)
new_Xtrain = norm_fit.transform(X_train)
new_Xtest = norm_fit.transform(X_test)
print(new_Xtrain)

[[0.56779661 0.13333333 0.41463415 ... 0.22580645 0. 0. ]

[0.47457627 0.05333333 0.24390244 ... 0.17741935 1. 0. ]

[0.48305085 0.52 0.21138211 ... 0.32258065 0. 0. ]

...

[0.34745763 0.16666667 0.22764228 ... 0.12903226 1. 0. ]

[0.43220339 0.12 0.26829268 ... 0.4516129 1. 0. ]

[0.34745763 0.28666667 0.27642276 ... 0.32258065 1. 0. ]]

In [ ]: 

import numpy as np
import pandas as pd

In [ ]: 

from sklearn.preprocessing import StandardScaler as ss

In [ ]: 

from sklearn.decomposition import PCA

In [ ]: 

from sklearn.model_selection import train_test_split


from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [ ]: 

from xgboost.sklearn import XGBClassifier

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 5/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

from sklearn.pipeline import Pipeline


from sklearn.pipeline import make_pipeline

In [ ]: 

from sklearn.metrics import accuracy_score


from sklearn.metrics import auc, roc_curve
from sklearn.metrics import confusion_matrix

In [ ]: 

import matplotlib.pyplot as plt


from xgboost import plot_importance
import seaborn as sns

In [ ]: 

from sklearn.model_selection import cross_val_score

In [ ]: 

from bayes_opt import BayesianOptimization

In [ ]: 

pip install bayesian-optimization

Requirement already satisfied: bayesian-optimization in c:\users\chaitanya\a


naconda3n\lib\site-packages (1.2.0)

Requirement already satisfied: numpy>=1.9.0 in c:\users\chaitanya\anaconda3n


\lib\site-packages (from bayesian-optimization) (1.19.2)

Requirement already satisfied: scikit-learn>=0.18.0 in c:\users\chaitanya\an


aconda3n\lib\site-packages (from bayesian-optimization) (0.24.2)

Requirement already satisfied: scipy>=0.14.0 in c:\users\chaitanya\anaconda3


n\lib\site-packages (from bayesian-optimization) (1.5.2)

Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\chaitanya\an


aconda3n\lib\site-packages (from scikit-learn>=0.18.0->bayesian-optimizatio
n) (2.1.0)

Requirement already satisfied: joblib>=0.11 in c:\users\chaitanya\anaconda3n


\lib\site-packages (from scikit-learn>=0.18.0->bayesian-optimization) (0.17.
0)

Note: you may need to restart the kernel to use updated packages.

In [ ]: 

import time
import random
from scipy.stats import uniform

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 6/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

pip install --upgrade pip

Requirement already satisfied: pip in c:\users\chaitanya\anaconda3n\lib\site


-packages (21.3.1)

Note: you may need to restart the kernel to use updated packages.

In [ ]: 

data = pd.read_csv(
"winequalityN.csv",
header = 0,
skiprows = lambda i: (i>0) and (random.random() > 0.9)
)

In [ ]: 

f"Data Shape : { data.shape }"

Out[27]:

'Data Shape : (5905, 13)'

In [ ]: 

data["quality"] = data["quality"].astype(float)
data.head(10)

Out[28]:

free total
fixed volatile citric residual
type chlorides sulfur sulfur density pH sulphates
acidity acidity acid sugar
dioxide dioxide

0 white 6.3 0.30 0.34 1.60 0.049 14.0 132.0 0.9940 3.30 0.49

1 white 8.1 0.28 0.40 6.90 0.050 30.0 97.0 0.9951 3.26 0.44

2 white 7.2 0.23 0.32 8.50 0.058 47.0 186.0 0.9956 3.19 0.40

3 white 8.1 0.28 0.40 6.90 0.050 30.0 97.0 0.9951 3.26 0.44

4 white 6.2 0.32 0.16 7.00 0.045 30.0 136.0 0.9949 3.18 0.47

5 white 7.0 0.27 0.36 20.70 0.045 45.0 170.0 1.0010 3.00 0.45

6 white 6.3 0.30 0.34 1.60 0.049 14.0 132.0 0.9940 3.30 0.49

7 white 8.1 0.22 0.43 1.50 0.044 28.0 129.0 0.9938 3.22 0.45

8 white 8.1 0.27 0.41 1.45 0.033 11.0 63.0 0.9908 2.99 0.56

9 white 8.6 0.23 0.40 4.20 0.035 17.0 109.0 0.9947 3.14 0.53

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 7/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

df = data.isnull().any().reset_index()
df.head()

Out[29]:

index 0

0 type False

1 fixed acidity True

2 volatile acidity True

3 citric acid True

4 residual sugar True

In [ ]: 

na_columns = df.loc[df.iloc[:, 1] == True, "index"].tolist()


print("\033[1mColumns list:\033[0m \n\n{0}".format("\n".join(na_columns)))

Columns list:

fixed acidity

volatile acidity

citric acid

residual sugar

chlorides

pH

sulphates

In [ ]: 

values = dict(map(lambda i: (i, float(data[i].mode())), na_columns))


maxlen = max([len(i) for i in values.keys()])
print("\033[1mMode Values:\033[0m \n")
for key, value in values.items():
print(key, " " * (maxlen - len(key) + 5), "{:.2}".format(value))

Mode Values:

fixed acidity 6.8

volatile acidity 0.28

citric acid 0.3

residual sugar 2.0

chlorides 0.036

pH 3.2

sulphates 0.5

In [ ]: 

data = data.fillna(value = values)

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 8/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

sns.pairplot(data.iloc[:, 4:10])

Out[33]:

<seaborn.axisgrid.PairGrid at 0x21b8cbc8130>

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 9/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

X = data.drop("type", axis=1)
X.head()

Out[34]:

free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcoh
acidity acidity acid sugar
dioxide dioxide

0 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9

1 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10

2 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9

3 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10

4 6.2 0.32 0.16 7.0 0.045 30.0 136.0 0.9949 3.18 0.47 9

In [ ]: 

y = data["type"]
y.value_counts()

Out[35]:

white 4455

red 1450

Name: type, dtype: int64

In [ ]: 

y = y.map({'white': 1, 'red': 0})


y.value_counts()

Out[36]:

1 4455

0 1450

Name: type, dtype: int64

In [ ]: 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, shuffle = True)

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 10/11


4/16/22, 10:42 AM 190030020 ML LAB 12 - Jupyter Notebook

In [ ]: 

print("X_train :", X_train.shape)


print("X_test :", X_test.shape)
print("y_train :", y_train.shape)
print("y_test :", y_test.shape)

X_train : (4428, 12)

X_test : (1477, 12)

y_train : (4428,)

y_test : (1477,)

In [ ]: 

steps_xg = [('sts', ss()), ('pca', PCA()), ('xg', XGBClassifier(silent = False, n_jobs=1))


pipe_xg = Pipeline(steps_xg)

In [ ]: 

parameters = {'xg__learning_rate': [0.1, 0.3, 0.5, 0.8],


'xg__n_estimators': [50, 65, 85, 100],
'xg__max_depth': [2, 3, 5, 7],
'pca__n_components' : [0.3, 0.5, 0.7, 0.9]}
clf = GridSearchCV(pipe_xg,
parameters,
n_jobs = 1,
cv =5 ,
verbose = 1,
scoring = ['accuracy', 'roc_auc'],
refit = 'roc_auc'
)

localhost:8888/notebooks/190030020 ML LAB 12.ipynb 11/11

You might also like