Wine DS
Wine DS
1. Data Injestion
2. EDA
3. Preprocessing
5. Evaluation
Abstract: Two datasets are included, related to red and white vinho verde wine samples, from the north of
Portugal. The goal is to model wine quality based on physicochemical tests.
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details,
consult:. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
Scope of Work:
1. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor
ones).
2. Outlier detection algorithms could be used to detect the few excellent or poor wines.
3. We are not sure if all input variables are relevant. So it could be interesting to test feature selection
methods.
Attribute Information:
1. Feature columns
* fixed acidity | Continous Data
* volatile acidity | Continous Data
* citric acid | Continous Data
* residual sugar | Continous Data
* chlorides | Continous Data
* free sulfur dioxide | Continous Data
* total sulfur dioxide | Continous Data
* density | Continous Data
* pH | Continous Data
* sulphates | Continous Data
* alcohol | Continous Data
2. Target column
* quality | Ordinal data (score between 3 to 8)
Citation:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining
from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
1. Data Injestion
Library Import
1 import pandas as pd
2 import numpy as np
3 from sklearn.model_selection import train_test_split, GridSearchCV
4 from sklearn.metrics import classification_report, accuracy_score, confusion_matrix , roc_auc_score, roc_curve
5 from sklearn.preprocessing import StandardScaler
6 import matplotlib.pyplot as plt
7 plt.style.use('ggplot')
8 import seaborn as sns
9 import warnings
10
warnings.filterwarnings("ignore")
11
%matplotlib inline
12
13
14
# from pandas_profiling import ProfileReport
15
16
# ! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
Data Import
1 data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv', header = 0, sep=';')
2 data.head()
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
EDA
1 data.columns
dtype='object')
1 data.info()
<class 'pandas.core.frame.DataFrame'>
Shape of dataset
1 data.shape
(1599, 12)
Summary of dataset
1 data.describe().T
fixed acidity 1599.0 8.319637 1.741096 4.60000 7.1000 7.90000 9.200000 15.90000
volatile acidity 1599.0 0.527821 0.179060 0.12000 0.3900 0.52000 0.640000 1.58000
citric acid 1599.0 0.270976 0.194801 0.00000 0.0900 0.26000 0.420000 1.00000
residual sugar 1599.0 2.538806 1.409928 0.90000 1.9000 2.20000 2.600000 15.50000
free sulfur dioxide 1599.0 15.874922 10.460157 1.00000 7.0000 14.00000 21.000000 72.00000
total sulfur dioxide 1599.0 46.467792 32.895324 6.00000 22.0000 38.00000 62.000000 289.00000
1 data.isna().sum()
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
1 data.quality.unique()
2 round(data.quality.value_counts()/(len(data))*100,2)
5 42.59
6 39.90
7 12.45
4 3.31
8 1.13
3 0.63
Name: quality, dtype: float64
Obesrvation:
Univariate analysis
Numerical Columns
1 plt.figure(figsize=(15, 15))
2 plt.suptitle('Univariate Analysis of Numerical Features', fontsize=20, fontweight='bold', alpha=0.8, y=1.)
3 numerical_features = [col for col in data.columns if data[col].dtypes != 'O']
4
5 for i in range(0, len(numerical_features)):
6 plt.subplot(5, 3, i+1)
7 sns.kdeplot(x=data[numerical_features[i]],shade=True, color='b')
8 plt.xlabel(numerical_features[i])
9 plt.tight_layout()
Observations:
Multivariate analysis
1 # In numerical column
2 plt.figure(figsize = (15,10))
3 matrix = np.triu(data.corr())
4 sns.heatmap(data.corr(), annot=True, mask = matrix)
5 plt.yticks(rotation=45)
6 plt.show()
Observations:
1 feature = data.drop(columns = 'quality')
2 feature.columns
dtype='object')
Here we will make an approach to understand the relation between the feature and target columns.
'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur
dioxide', 'density','pH', 'sulphates', 'alcohol'
Label columns:
'quality'
1 # Getting the percentage of each category in 'quality' column
2 sns.countplot(x = 'quality', data= data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4a042035d0>
1 feature_continous = [col for col in feature.columns if data[col].dtypes != 'O']
1 fig = plt.figure(figsize=(15, 50))
2 plt.suptitle('Box Plot between feature and continous label ', fontsize = 20, y = 1)
3
4 for i in range(0, len(feature_continous)):
5 ax = plt.subplot(10, 3, i+1)
6 sns.boxplot(data = data, x = 'quality', y = data[feature_continous[i]])
7 plt.tight_layout()
Preliminary Conclusions:
1 sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7f4a04037610>
Preprocessing
Bringing down the data to same scale will surely reduce the computation time
1 x = data.drop(columns = 'quality')
2 y = data['quality']
1 x_train, x_test, y_train,y_test = train_test_split(x , y, test_size = .2, random_state = 0)
1 scaler = StandardScaler()
2 x_train_tf = scaler.fit_transform(x_train)
3 # get the parameter
4 scaler.mean_
10.41399531])
1 x_test_tf = scaler.transform(x_test)
Model Building
1 from sklearn.svm import SVC
Raw SVC
1 model_svc = SVC()
2 model_svc.fit(x_train_tf, y_train)
3 model_svc.score(x_train_tf, y_train)
4 print(f"Accuracy score is: {model_svc.score(x_train_tf, y_train)}")
5 predict_raw = model_svc.predict(x_test_tf)
6
7
8
1 model_svc_tune = SVC()
2 params = [{'C': [.5,.9,1,1.2,1.3,1.5]}]
3 clf = GridSearchCV(model_svc_tune, params, cv = 10, scoring='accuracy')
4 clf.fit(x_train_tf, y_train)
5 print(f'best value of C is {clf.best_params_}')
7 model_svc_tune = SVC()
8 params = {'kernel': [ 'rbf','linear','poly','sigmoid' ],
9 'degree': [ 2,3,4,5,6 ]}
10
clf = GridSearchCV(model_svc_tune, params, cv = 10, scoring='accuracy')
11
clf.fit(x_train_tf, y_train)
12
print(clf.best_params_)
13
14
model_svc_tune = SVC()
15
params = {'gamma' :[0.8,0.9,1,1.1,1.2,1.3]}
16
clf = GridSearchCV(model_svc_tune, params, cv = 10, scoring='accuracy')
17
clf.fit(x_train_tf, y_train)
18
print(clf.best_params_)
{'gamma': 1}
1 params = {
2 'C': [.9,1,1.2,1.3],
3 'kernel':['rbf','linear'],
4 'gamma': [.9,1,1.1]
5 }
6
7 clf = GridSearchCV(model_svc_tune, params, cv = 10, scoring='accuracy')
8 clf.fit(x_train_tf, y_train)
9 print(clf.best_params_)
1 model_svc_tune = SVC(C = 1.3,kernel= 'rbf', gamma = 1.3)
2 model_svc_tune.fit(x_train_tf, y_train)
3 predict_tuned = model_svc_tune.predict(x_test_tf)
Evaluation
Raw Model
1 print(f'Accuracy Score: {accuracy_score(y_test, predict_raw)}')
1 print('Classification report')
2 print(classification_report(y_test, predict_raw))
Classification report
Tuned Model
1 print(f'Accuracy Score: {accuracy_score(y_test, predict_tuned)}')
1 print('Classification Report')
2 print(classification_report(y_test, predict_tuned))
Classification Report
1 pd.read_csv("https://raw.githubusercontent.com/srinivasav22/Graduate-Admission-Prediction/master/Admission_Predict_Ver1.1.csv")