Data Pre Processing 1
Data Pre Processing 1
Standard deviation from the mean is another common method to detect extreme values
But it can be problematic:
Assumes normality
Sensitive to very extreme values
X = pima_df.loc[:,'Pregnancies':'Age']
outlier_df = X['Age'][((X['Age']-X['Age'].mean()).abs() > 2.75*X['Age'].std())]
print(outlier_df)
out_indices = outlier_df.index
print(out_indices)
import numpy
123 69
221 66
363 67
453 72
459 81
489 67
495 66
537 67
552 66
666 70
674 68
684 69
759 66
Name: Age, dtype: int64
Int64Index([123, 221, 363, 453, 459, 489, 495, 537, 552, 666, 674, 684, 759],
dtype='int64')
X.loc[out_indices] = np.nan
print(X)
mid = X['Age'].median()
X.loc[out_indices] = mid
print(X)
X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()
<ggplot: (7547461505)>
ggplot(pima_df,aes(x='Age',y='Glucose',colour = 'BloodPressure'))
+geom_point()+stat_smooth()+facet_wrap('~Outcome')
<ggplot: (294288713)>
ggplot(pima_df,aes(x='Age', y
='Pregnancies'))+geom_point(aes(color='BMI'))+facet_wrap('~Outcome')+stat_smooth()
<ggplot: (294281529)>
<matplotlib.axes._subplots.AxesSubplot at 0x1c1e1a6510>
We can observe that there are correlatiom between some columns
Age is highly correlated with pregnancies
Insulin is correlated with skin Glucose
skin thickness is correlated with BMI
<seaborn.axisgrid.FacetGrid at 0x1c1e19bbd0>
<seaborn.axisgrid.FacetGrid at 0x1c1e3ee9d0>
sns.lmplot(x='BMI', y = 'SkinThickness', hue = 'Outcome', data = pima_df)
<seaborn.axisgrid.FacetGrid at 0x10f8e4390>
#Visualise pairplot using seaborn which will give plot against each attribute to
another attribute
sns.pairplot(pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']])
<seaborn.axisgrid.PairGrid at 0x10fb49750>
4. Data Scaling
Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.
It can help in methods that weight inputs in order to make a prediction, such as in linear regression and
logistic regression. It is practically required in methods that combine weighted inputs in complex ways such
as in artificial neural networks and deep learning.
We will discuss:
1.Normalise Data
2.Standardize Data
3.When to Normalise and Standardize
1.Normalize Data
Normalization can refer to different techniques depending on context. Here, we use normalization to refer
to rescaling an input variable to the range between 0 and 1. Normalization requires that you know the
minimum and maximum values for each attribute. This can be estimated from training data or specified
directly if you have deep knowledge of the problem domain. You can easily estimate the minimum and
maximum values for each attribute in a dataset by enumerating through the values.
Once we have estimates of the maximum and minimum allowed values for each column, we can normalize
the raw data to the range 0 and 1. The calculation to normalize a single value for a column is:
scaled value = (value - min)/(max - min)
np.set_printoptions(precision=3)
array = np.array(pima_df.values)
print("== Generating data sets ==")
Normalized_attributes: range of 0 to 1
0 1 2 3 4 5 \
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 0.226180 0.501205 0.493930 0.240798 0.170130 0.291564
std 0.198210 0.196361 0.123432 0.095554 0.102189 0.140596
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.058824 0.359677 0.408163 0.195652 0.129207 0.190184
50% 0.176471 0.470968 0.491863 0.240798 0.170130 0.290389
75% 0.352941 0.620968 0.571429 0.271739 0.170130 0.376278
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
6 7
count 768.000000 768.000000
mean 0.168179 0.204015
std 0.141473 0.196004
min 0.000000 0.000000
25% 0.070773 0.050000
50% 0.125747 0.133333
75% 0.234095 0.333333
max 1.000000 1.000000
2. Standardize Data
Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0
and the standard deviation to the value 1. Together, the mean and the standard deviation can be used to
summarize a normal distribution, also called the Gaussian distribution or bell curve. It requires that the
mean and standard deviation of the values for each column be known prior to scaling. As with normalizing
above, we can estimate these values from training data, or use domain knowledge to specify their values.
The standard deviation describes the average spread of values from the mean. It can be calculated as the
square root of the sum of the squared difference between each value and the mean and dividing by the
number of values minus 1.
Once mean and standard deviation is calculated we can easily calculate standardized value.The calculation
to standardize a single value for a column is: :
standardized value = (value - mean)/stdev
5 6 7
count 7.680000e+02 7.680000e+02 7.680000e+02
mean 3.090699e-16 2.398978e-16 1.857600e-16
std 1.000652e+00 1.000652e+00 1.000652e+00
min -2.075119e+00 -1.189553e+00 -1.041549e+00
25% -7.215397e-01 -6.889685e-01 -7.862862e-01
50% -8.363615e-03 -3.001282e-01 -3.608474e-01
75% 6.029301e-01 4.662269e-01 6.602056e-01
max 5.042087e+00 5.883565e+00 4.063716e+00
If your data is not normally distributed, consider normalizing it prior to applying your machine learning
algorithm. It is good practice to record the minimum and maximum values for each column used in the
normalization process, again, in case you need to normalize new data in the future to be used with your
model.
print("df_minority['class'].size", df_minority['Outcome'].size)
from sklearn.utils import resample
# Downsample majority class
df_majority_downsampled = resample(df_majority,
replace=False, # sample without replacement
n_samples=df_minority['Outcome'].size, # match minority
class
random_state=7) # reproducible results
("df_minority['class'].size", 268)
print("undersampled", df_downsampled.groupby('Outcome').size())
df_downsampled=df_downsampled.sample(frac=1).reset_index(drop=True)
undersampling_attr = np.array(df_downsampled.values[:,0:8])
undersampling_label = np.array(df_downsampled.values[:,8])
('undersampled', Outcome
0 268
1 268
dtype: int64)
oversampling_attr = oversampled_df.values[:,0:8]
oversampling_label = oversampled_df.values[:,8]
print("oversampled_df", oversampled_df.groupby('label').size())
missing_attr = np.array(dataset_missing.values[:,0:8])
missing_label = np.array(dataset_missing.values[:,8])
print("=== imputing by replacing missing values with mean column values ===")
dataset_impute = dataset_cp.fillna(dataset_cp.mean())
# count the number of NaN values in each column
print(dataset_impute.isnull().sum())
impute_attr = np.array(dataset_impute.values[:,0:8])
(768, 9)
=== imputing by replacing missing values with mean column values ===
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
== addressing class imbalance under or over sampling ==
Each principal component is calculated by finding the linear combination of features that maximizes
variance, while also ensuring zero correlation with the previously calculated principal components
pca = PCA(n_components=5)
pca.fit(diabetes_attr)
diabetes_attr_pca = pca.transform(diabetes_attr)
print("original shape: ", diabetes_attr.shape)
print("transformed shape:", diabetes_attr_pca.shape)
pca.fit(normalized_attr)
normalized_attr_pca = pca.transform(normalized_attr)
pca.fit(standardized_attr)
standardized_attr_pca = pca.transform(standardized_attr)
pca.fit(impute_attr)
impute_attr_pca = pca.transform(impute_attr)
pca.fit(missing_attr)
missing_attr_pca = pca.transform(missing_attr)
pca.fit(undersampling_attr)
undersampling_attr_pca = pca.transform(undersampling_attr)
pca.fit(oversampling_attr)
oversampling_attr_pca = pca.transform(oversampling_attr)
Evaluate Algorithms
print(" == Evaluate Some Algorithms == ")
# Split-out validation dataset
print(" == Create a Validation Dataset: Split-out validation dataset == ")
# significance tests
import scipy.stats as stats
import math
print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr, label))
datasets.append(('normalized_attr', normalized_attr, label))
datasets.append(('standardized_attr', standardized_attr, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr, missing_label))
datasets.append(('undersampling_attr', undersampling_attr, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr, oversampling_label))
models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))
# Compare Algorithms
print(" == Select Best Model, Compare Algorithms == ")
fig = plt.figure()
fig.suptitle('Algorithm Comparison for ' + dataname)
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.ylabel(scoring)
ax.set_xticklabels(names)
plt.show()
= normalized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765619 (0.046566) False nan
LDA: 0.766951 (0.052975) False 0.828238
KNN: 0.748701 (0.062006) False 0.235048
CART: 0.700496 (0.048400) True 0.001043
NB: 0.747386 (0.043583) False 0.132240
RF: 0.746036 (0.058189) False 0.061152
SVM: 0.770813 (0.052488) False 0.309233
== Select Best Model, Compare Algorithms ==
= standardized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.770813 (0.051248) False nan
LDA: 0.766951 (0.052975) False 0.526999
KNN: 0.738278 (0.039157) True 0.030019
CART: 0.687440 (0.063132) True 0.001104
NB: 0.747386 (0.043583) False 0.061474
RF: 0.757707 (0.060612) False 0.356660
SVM: 0.753913 (0.044789) True 0.022590
== Select Best Model, Compare Algorithms ==
= impute_attr =
algorithm,mean,std,signficance,p-val
LR: 0.764320 (0.048484) False nan
LDA: 0.766951 (0.052975) False 0.675096
KNN: 0.713534 (0.064980) True 0.014497
CART: 0.696617 (0.055419) True 0.000589
NB: 0.747386 (0.043583) False 0.243960
RF: 0.764234 (0.057085) False 0.996000
SVM: 0.651025 (0.072141) True 0.000669
== Select Best Model, Compare Algorithms ==
= missing_attr =
algorithm,mean,std,signficance,p-val
LR: 0.764337 (0.047320) False nan
LDA: 0.766951 (0.052975) False 0.640480
KNN: 0.713534 (0.064980) True 0.014492
CART: 0.691353 (0.063152) True 0.002221
NB: 0.747386 (0.043583) False 0.220344
RF: 0.734330 (0.062398) True 0.037339
SVM: 0.651025 (0.072141) True 0.000570
== Select Best Model, Compare Algorithms ==
= undersampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.749895 (0.053692) False nan
LDA: 0.751747 (0.071419) False 0.897521
KNN: 0.694165 (0.071292) True 0.009508
CART: 0.663941 (0.081059) True 0.016613
NB: 0.710936 (0.081741) False 0.058574
RF: 0.720335 (0.059445) False 0.210477
SVM: 0.458910 (0.072346) True 0.000001
== Select Best Model, Compare Algorithms ==
= oversampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.755000 (0.051039) False nan
LDA: 0.749000 (0.045486) False 0.111373
KNN: 0.768000 (0.027129) False 0.481468
CART: 0.762000 (0.028213) False 0.677050
NB: 0.713000 (0.046054) True 0.001323
RF: 0.814000 (0.045869) True 0.002612
SVM: 0.713000 (0.043829) False 0.090454
== Select Best Model, Compare Algorithms ==
# significance tests
import scipy.stats as stats
import math
print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr_pca, label))
datasets.append(('normalized_attr', normalized_attr_pca, label))
datasets.append(('standardized_attr', standardized_attr_pca, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr_pca, missing_label))
datasets.append(('undersampling_attr', undersampling_attr_pca, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr_pca, oversampling_label))
models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))
# Compare Algorithms
print(" == Select Best Model, Compare Algorithms == ")
fig = plt.figure()
fig.suptitle('Algorithm Comparison for ' + dataname)
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.ylabel(scoring)
ax.set_xticklabels(names)
plt.show()
= normalized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.763004 (0.052644) False nan
LDA: 0.759091 (0.049164) False 0.432778
KNN: 0.720010 (0.064269) True 0.001669
CART: 0.654802 (0.061094) True 0.000207
NB: 0.742208 (0.045907) False 0.140785
RF: 0.740858 (0.063580) False 0.123441
SVM: 0.772095 (0.054777) True 0.009535
== Select Best Model, Compare Algorithms ==
= standardized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.748701 (0.033960) False nan
LDA: 0.742208 (0.031646) False 0.272912
KNN: 0.718763 (0.051160) True 0.033641
CART: 0.704323 (0.043981) True 0.007545
NB: 0.721343 (0.035560) True 0.035025
RF: 0.716131 (0.047187) True 0.008683
SVM: 0.733083 (0.046566) False 0.179971
== Select Best Model, Compare Algorithms ==
= impute_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687389 (0.049055) True 0.000286
NB: 0.747386 (0.043583) False 0.203854
RF: 0.751299 (0.053382) False 0.169036
SVM: 0.651025 (0.072141) True 0.000537
== Select Best Model, Compare Algorithms ==
= missing_attr =
algorithm,mean,std,signficance,p-val
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.663995 (0.055143) True 0.000938
NB: 0.747454 (0.048375) False 0.195421
RF: 0.731716 (0.058688) False 0.065855
SVM: 0.651025 (0.072141) True 0.000815
== Select Best Model, Compare Algorithms ==
= undersampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.716387 (0.058158) False nan
LDA: 0.716282 (0.060502) False 0.983147
KNN: 0.690426 (0.067903) False 0.208882
CART: 0.673620 (0.065603) False 0.115618
NB: 0.701398 (0.058597) False 0.206335
RF: 0.686513 (0.074474) False 0.179856
SVM: 0.451468 (0.063015) True 0.000002
== Select Best Model, Compare Algorithms ==
= oversampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.711000 (0.038588) False nan
LDA: 0.717000 (0.040262) False 0.051003
KNN: 0.762000 (0.023580) True 0.000407
CART: 0.742000 (0.049960) False 0.135066
NB: 0.708000 (0.050951) False 0.802536
RF: 0.767000 (0.024920) True 0.001139
SVM: 0.684000 (0.052192) False 0.261975
== Select Best Model, Compare Algorithms ==